Thursday, March 05, 2009

The Evils of Encoding Meaning Into Data

About a year ago I worked on collating and transforming data from an application so that it could be imported into another app. I've performed this exercise a number of times in the past 20 or so years, and while it is never boring, it is sometimes quite challenging.

Oft times when trying to make the data suitable for inclusion in the new applications, I ask my self "What were they thinking?"

I will leave the answer to that up to your imagination, as my answers to that particular question are not always complimentary.

One of the problems run into is when the data modeler and database designer chose to allow data with encoded special meanings.

In other words, there is key data in the database, the meaning of which can only be ascertained by asking someone that knows, or finding it in the documentation (Is there documentation?)

The example I will use is a real one I ran into.

Given a table containing projects, and revisions of those projects that are indicated as such by a revision "number".

A sensible series of revision numbers might be a numeric range beginning with 1, or even an alpha range beginning with 'A', or even a combination thereof.

Personally, I prefer the unambiguous nature of an ascending numeric range. 2 is always greater than 1. There is no case involved as there would be with an alpha range.

Which is greater, "a1" or "A1"? You know how this will sort in the database, but will it be what the users expect?

While a numeric range would have been preferable, the data I was given to work with used a range of revision "numbers" that was numeric and alpha, with a numeric value being regarding as greater than the "numeric" value. The "numeric" is in quotes, as this obviously must be stored as a string, and in this case with a leading zero.

Given this rule, a project with revisions of A,B,C,D,01,02 would have a most recent revision of "02". This is not the way it works in the standard collating order in the database.

11:34:03 SQL> create table rev (version varchar2(2));

Table created.

11:34:03 SQL>
11:34:03 SQL> insert into rev values('A');
11:34:03 SQL> insert into rev values('B');
11:34:03 SQL> insert into rev values('C');
11:34:03 SQL> insert into rev values('01');
11:34:03 SQL> insert into rev values('02');
11:34:03 SQL>
11:34:03 SQL> select * from rev order by version;

VE
--
01
02
A
B
C

5 rows selected.

In a perfect world, the data modeler or data architect would work with the users to create a reasonable versioning method. In this case however there is no choice but to work with what I was given.

From the data provided, only the most recent version was to be included in the imported data.
With a versioning system that doesn't follow the normal collating order, this requires some code to determine what is really the greatest version.

If you know a method to accomplish this in straight SQL, please feel free to post it. I could not think of a pure SQL solution.

The following generic function was created to determine the greatest revision value given a table name and the key columns.

-- maxrev.sql
-- return the maximum numeric revision
-- returna alpha if no numeric exists
-- maximum rev is numeric
-- revs are alpha - mixed number and character

create or replace function maxrev
(
table_name_in varchar2,
key_column_in varchar2,
value_column_in varchar2,
key_value_in varchar2
)
return varchar2
as
v_max_rev integer := 0;
v_tmp_rev integer;
v_col_rev varchar2(10);
v_return_rev varchar2(30);
type curType is ref cursor;
l_cursor curType;
v_sql varchar2(1000);

function is_number( chk_data_in varchar2 )
return boolean
is
dummy number(38,4);
begin
dummy := to_number(chk_data_in);
return true;
exception
when value_error then
return false;
when others then
raise;
end;

begin
-- get the maximum rev, whether alpha or numeric
-- there may not be any numerics
v_sql := 'select max(' || value_column_in || ') from ' || table_name_in || ' where ' || key_column_in || ' = :1';

execute immediate v_sql into v_return_rev using key_value_in;

--return v_return_rev;

v_sql := 'select ' || value_column_in || ' from ' || table_name_in || ' where ' || key_column_in || ' = :1';

open l_cursor for v_sql using key_value_in;
loop
fetch l_cursor into v_col_rev;
exit when l_cursor%notfound;

if (is_number(v_col_rev)) then
v_tmp_rev := to_number(v_col_rev);
end if;
if (v_tmp_rev > v_max_rev) then
v_max_rev := v_tmp_rev;
v_return_rev := v_col_rev;
end if;
end loop;

return v_return_rev;
end;
/

( Reviewing this function just now, I see what could be considered a programming error.
Let me know if you spot it. )

Here's a test case to prove that the function works as expected.

-- maxrev_test.sql
-- should always return numeric if it exists, otherwise alpha

drop table maxrev_test;

create table maxrev_test ( id varchar2(4), rev varchar2(2));

insert into maxrev_test values('T1', 'A');
insert into maxrev_test values('T1', 'B');
insert into maxrev_test values('T1', '01');
insert into maxrev_test values('T1', '02');
insert into maxrev_test values('T2', '01');
insert into maxrev_test values('T2', '02');
insert into maxrev_test values('T2', '03');
insert into maxrev_test values('T3', 'X');
insert into maxrev_test values('T3', 'Y');
insert into maxrev_test values('T3', 'Z');

commit;

select * from maxrev_test order by id,rev;

col rev format a10

prompt
prompt Incorrect results
prompt

select id, max(rev) rev
from maxrev_test
group by id
order by id
/

prompt
prompt Correct results
prompt

select id, maxrev('maxrev_test','id','rev',id) rev
from maxrev_test
group by id
order by id
/



And the results:

Incorrect results

ID REV
---- ----------
T1 B
T2 03
T3 Z

3 rows selected.

Correct results

ID REV
---- ----------
T1 02
T2 03
T3 Z

3 rows selected.

15 comments:

Narendra said...

Jared,

What version of oracle database are you using?
I could reproduce the expected results with your data using only SQL as follows:

SQL> select id, max(rev) keep (dense_rank last order by flag nulls first) rev
2 from (
3 select id, rev,
4 case when mreg_rev = 0 then Decode(regexp_instr(rev, '^[A-Z]'), 0, to_number(rev))
5 when mreg_rev = 1 then ascii(rev)
6 end flag
7 from (select id, rev, reg_rev, min(reg_rev) over (partition by id) mreg_rev
8 from (select id, rev, regexp_instr(rev, '^[A-Z]') reg_rev from maxrev_test)))
9 group by id ;

ID RE
---- --
T1 02
T2 03
T3 Z


This is based on following assumptions:
1. The version numbers can not start with a digit and also contain alphabets.
2. A "numeric"-only version will always be considered later than an alpha-numeric version.
3. No performace comparison made.

Now, I am a bad programmer so are you talking about any of the following possible "errors":
1. The cursor l_cur is not closed
2. v_tmp_rev is not initialized
3. As the function is generic, should it not check for existance of table & column ?
4. What is the purpose of following code? Shouldn't it be removed ?

-- get the maximum rev, whether alpha or numeric
-- there may not be any numerics
v_sql := 'select max(' || value_column_in || ') from ' || table_name_in || ' where ' || key_column_in || ' = :1';

execute immediate v_sql into v_return_rev using key_value_in;

Narendra said...

Jared,

Forgot to mention that I tested my solution on Oracle 10g.

Jared said...

Narenda, that's a good use of analytics. I did not come up with that solution obviously.

The database was 9i, but I believe this SQL would have worked, I will try it later.

Even with just SQL, that is a lot of code to write to do what would be just a simple max() if reasonable data were used for revision numbers.

Re mistakes, I was referring to the fact that v_tmp_rev was not initialized, as you mentioned.

Regarding the others:

l_cursor: technically should be closed. It's not reused in the code block.

check for existence of table: there was a set list of tables this was used with. If the table didn't exist, I just let the process fail and start over

code to be removed: no, take another look. This gets the max value. The code following it may not execute.

Narendra said...

Jared,

The sql which I suggested will not work on 9i as it uses regular expression support as well as KEEP extension to aggregate functions, which, I guess, was introduced in Oracle 10g.
Also, I agree that that is a lot of code for something that should be just a simple MAX(). So first and most correct choice will always be to ensure that the version number solution allows simple use of MAX() to find latest version. I just thought it was an interesting problem and took it up on myself as a challenge to solve it using only SQL. Also, I know that many times things like being able to use MAX() seems an ideal world wish and we have to live with bad design.
Now about my last observation. I again had a look at the code and could not understand how "following code may not execute". Maybe I am feeling dumb...:)

Jared said...

Admittedly I didn't look closely enough at your SQL to notice the regex.

That could still probably be done in 9i using the owa_pattern package.

As for feeling 'dumb', I see no reason for that after the bit of SQL you posted up. :)

Rather than me trying to explain why the code doesn't work if that one line is removed, you could run the example both with and without it.

It will not work properly without that first select into v_return_rev.

Narendra said...

Jared,

I will try playing with your code.
In the meantime, I was wondering if I had over-complicated my query and if there was a way to simplify it (and also make it compatible with 9i).
Now, in addition to my assumpations earlier, if we can make an assumption that version numbers can not contain more than one alphabet i.e. there will not be version numbers like 'AA', 'AB' etc. and also that version numbers can not contain characters other than alpha-numeric, then I came up with following SQL:

SQL> select id, rev
2 from (select id, rev, reg_rev, max(reg_rev) over (partition by id) max_rev
3 from (select id, rev, case when ascii(rev) between 48 and 57 then to_number(rev) + 100 else as
cii(rev) end reg_rev from maxrev_test))
4 where reg_rev = max_rev ;

ID RE
---- --
T1 02
T2 03
T3 Z

Jared said...

I just plugged that in to the unit test on a 9i database, and it works just as well as the function, only faster.

Clearly I need to try an analytic solution before writing a function.

Narendra said...

Jared,

Good to know your results.
Also, when I played with your function (just for my understanding), I realised that first sql works when an ID has only alphabetical versions.
Well, that confirms I have a long way to go...:)

Brian Camire said...

Would it be feasible to create a table that lists all possible version codes, along with a corresponding (probably unique) ascending version number? It might look something like this:

SQL> CREATE TABLE VERSION_CODES
2 (
3 VERSION_CODE VARCHAR(2) PRIMARY KEY,
4 VERSION_NUMBER INTEGER NOT NULL UNIQUE
5 );
SQL> INSERT INTO VERSION_CODES VALUES('A', 1);
SQL> INSERT INTO VERSION_CODES VALUES('B', 2);
.
.
.
SQL> INSERT INTO VERSION_CODES VALUES('X', 24);
SQL> INSERT INTO VERSION_CODES VALUES('Y', 25);
SQL> INSERT INTO VERSION_CODES VALUES('Z', 26);
SQL> INSERT INTO VERSION_CODES VALUES('01', 27);
SQL> INSERT INTO VERSION_CODES VALUES('02', 28);
SQL> INSERT INTO VERSION_CODES VALUES('03', 29);
.
.
.

You could then join your source table ("maxrev_test" in your example) to this table to get the version number, and then select the row(s) with the maximum version number. It might look something like this:

SQL> SELECT
2 ID,
3 REV
4 FROM
5 (
6 SELECT
7 MAXREV_TEST.*,
8 VERSION_CODES.VERSION_NUMBER,
9 MAX(VERSION_CODES.VERSION_NUMBER)
10 OVER (PARTITION BY MAXREV_TEST.ID) AS MAXIMUM_VERSION_NUMBER
11 FROM
12 MAXREV_TEST,
13 VERSION_CODES
14 WHERE
15 VERSION_CODES.VERSION_CODE = MAXREV_TEST.REV
16 )
17 WHERE
18 VERSION_NUMBER = MAXIMUM_VERSION_NUMBER;

ID RE
---- --
T1 02
T2 03
T3 Z

Jared said...

A table of version codes would be feasible, but not practical in this case.

We received several iterations of data, and the codes are not necessarily consistent.

It would require discovering the set of codes present in each set of data, and then updating the code table.

Using a function eliminates a step.

Also, keep in mind that this is part of a conversion process, not an ongoing application effort.

At least in my mind, the function was the way to go.

Using a table as you suggest would certainly be something to consider for use in an app.

Then there would be need to catch exceptions in the event the codes are not constrained.

Brian Camire said...

If this is a one-time conversion, you have the luxury of discovering the set of codes ahead of time and knowing they will not change, so there should be no need to
"catch exceptions" (provided you have not missed a step).

If the meaning of the codes changes from batch to batch, you could use a hybrid approach, where you use your function (or something equivalent) to re-populate the VERSION_CODES table from the distinct set of codes represented in the batch to be converted. In this way, the VERSION_CODES table serves as a sort of rudimentary function result cache a la 11g.

Still, if you have another approach that does (or has already done) the job, I guess it doesn't matter too much...

Jared said...

I would emphasize the part where using a function guarantees that I don't have to think about what codes would appear in the next batch of data - it would just work.

And, yet it is done.

Stephan said...

select id,
nvl( revnum, rev )
from (

select id,
max( case when replace( translate ( rev, '0123456789', '0' ), '0', '' ) is null then rev end ) revnum,
max( rev ) rev
from maxrev_test
group by id
order by id
)

Jared said...

Stephan,

That isn't quite it.
Correct results would be:

ID REV
---- ----------
T1 02
T2 03
T3 Z

3 rows selected.

Your query returns:

ID RE REV
---- -- ----------
T1 02 B
T2 03 03
T3 Z

3 rows selected.

Modifying the query slightly does work however:

select id, decode(revnum,null, rev, revnum) rev
from (
select id,
max( case when replace( translate ( rev, '0123456789', '0' ), '0', '' ) is null then rev end ) revnum,
max( rev ) rev
from maxrev_test
group by id
)
order by id

Stephan said...

Interesting....

SQL>select id,
2 nvl( revnum, rev )
3 from (
4 select id,
5 max( case when replace( translate ( rev, '0123456789', '0' ), '0', '' ) is
null then rev end ) revnum,
6 max( rev ) rev
7 from maxrev_test
8 group by id
9 order by id
10 )
11 /

ID NV
---- --
T1 02
T2 03
T3 Z