Thursday, March 05, 2009

The Evils of Encoding Meaning Into Data

About a year ago I worked on collating and transforming data from an application so that it could be imported into another app. I've performed this exercise a number of times in the past 20 or so years, and while it is never boring, it is sometimes quite challenging.

Oft times when trying to make the data suitable for inclusion in the new applications, I ask my self "What were they thinking?"

I will leave the answer to that up to your imagination, as my answers to that particular question are not always complimentary.

One of the problems run into is when the data modeler and database designer chose to allow data with encoded special meanings.

In other words, there is key data in the database, the meaning of which can only be ascertained by asking someone that knows, or finding it in the documentation (Is there documentation?)

The example I will use is a real one I ran into.

Given a table containing projects, and revisions of those projects that are indicated as such by a revision "number".

A sensible series of revision numbers might be a numeric range beginning with 1, or even an alpha range beginning with 'A', or even a combination thereof.

Personally, I prefer the unambiguous nature of an ascending numeric range. 2 is always greater than 1. There is no case involved as there would be with an alpha range.

Which is greater, "a1" or "A1"? You know how this will sort in the database, but will it be what the users expect?

While a numeric range would have been preferable, the data I was given to work with used a range of revision "numbers" that was numeric and alpha, with a numeric value being regarding as greater than the "numeric" value. The "numeric" is in quotes, as this obviously must be stored as a string, and in this case with a leading zero.

Given this rule, a project with revisions of A,B,C,D,01,02 would have a most recent revision of "02". This is not the way it works in the standard collating order in the database.

11:34:03 SQL> create table rev (version varchar2(2));

Table created.

11:34:03 SQL>
11:34:03 SQL> insert into rev values('A');
11:34:03 SQL> insert into rev values('B');
11:34:03 SQL> insert into rev values('C');
11:34:03 SQL> insert into rev values('01');
11:34:03 SQL> insert into rev values('02');
11:34:03 SQL>
11:34:03 SQL> select * from rev order by version;

VE
--
01
02
A
B
C

5 rows selected.

In a perfect world, the data modeler or data architect would work with the users to create a reasonable versioning method. In this case however there is no choice but to work with what I was given.

From the data provided, only the most recent version was to be included in the imported data.
With a versioning system that doesn't follow the normal collating order, this requires some code to determine what is really the greatest version.

If you know a method to accomplish this in straight SQL, please feel free to post it. I could not think of a pure SQL solution.

The following generic function was created to determine the greatest revision value given a table name and the key columns.

-- maxrev.sql
-- return the maximum numeric revision
-- returna alpha if no numeric exists
-- maximum rev is numeric
-- revs are alpha - mixed number and character

create or replace function maxrev
(
table_name_in varchar2,
key_column_in varchar2,
value_column_in varchar2,
key_value_in varchar2
)
return varchar2
as
v_max_rev integer := 0;
v_tmp_rev integer;
v_col_rev varchar2(10);
v_return_rev varchar2(30);
type curType is ref cursor;
l_cursor curType;
v_sql varchar2(1000);

function is_number( chk_data_in varchar2 )
return boolean
is
dummy number(38,4);
begin
dummy := to_number(chk_data_in);
return true;
exception
when value_error then
return false;
when others then
raise;
end;

begin
-- get the maximum rev, whether alpha or numeric
-- there may not be any numerics
v_sql := 'select max(' || value_column_in || ') from ' || table_name_in || ' where ' || key_column_in || ' = :1';

execute immediate v_sql into v_return_rev using key_value_in;

--return v_return_rev;

v_sql := 'select ' || value_column_in || ' from ' || table_name_in || ' where ' || key_column_in || ' = :1';

open l_cursor for v_sql using key_value_in;
loop
fetch l_cursor into v_col_rev;
exit when l_cursor%notfound;

if (is_number(v_col_rev)) then
v_tmp_rev := to_number(v_col_rev);
end if;
if (v_tmp_rev > v_max_rev) then
v_max_rev := v_tmp_rev;
v_return_rev := v_col_rev;
end if;
end loop;

return v_return_rev;
end;
/

( Reviewing this function just now, I see what could be considered a programming error.
Let me know if you spot it. )

Here's a test case to prove that the function works as expected.

-- maxrev_test.sql
-- should always return numeric if it exists, otherwise alpha

drop table maxrev_test;

create table maxrev_test ( id varchar2(4), rev varchar2(2));

insert into maxrev_test values('T1', 'A');
insert into maxrev_test values('T1', 'B');
insert into maxrev_test values('T1', '01');
insert into maxrev_test values('T1', '02');
insert into maxrev_test values('T2', '01');
insert into maxrev_test values('T2', '02');
insert into maxrev_test values('T2', '03');
insert into maxrev_test values('T3', 'X');
insert into maxrev_test values('T3', 'Y');
insert into maxrev_test values('T3', 'Z');

commit;

select * from maxrev_test order by id,rev;

col rev format a10

prompt
prompt Incorrect results
prompt

select id, max(rev) rev
from maxrev_test
group by id
order by id
/

prompt
prompt Correct results
prompt

select id, maxrev('maxrev_test','id','rev',id) rev
from maxrev_test
group by id
order by id
/



And the results:

Incorrect results

ID REV
---- ----------
T1 B
T2 03
T3 Z

3 rows selected.

Correct results

ID REV
---- ----------
T1 02
T2 03
T3 Z

3 rows selected.