Thursday, January 31, 2008

Detect numbers with TRANSLATE() - Take two

Last week I wrote about using TRANSLATE to detect numbers in data Using The TRANSLATE() function...

Andrew Clarke at Radio Free Tooting pointed out the shortcomings of using TRANSLATE() to detect numbers.

As I said earlier, all I needed to do was detect if the characters in a string were all digits or not, and I wanted it to be very fast.

But Andrew's remarks got me thinking - could translate be used to detect more complex numbers?

Here's the short list of requirements:

* Detect integers
* Detect numbers with decimal point ( 4.6, 0.2, .7)
* Detect negative and positive ( leading + or - )
* Reject text with more than 1 '.', such as an IP address ( 127.0.0.1 )
* Reject anything with alpha text

And comma's are considered as text. 99,324.1 would be alpha.

If you need to do this on 10g, no problem, as a regular expression can handle it.

Fist create some test data:

drop table number_test;

create table number_test( alphacol varchar2(20));

insert into number_test values('.5');
insert into number_test values('1');
insert into number_test values('2');
insert into number_test values(' 3');
insert into number_test values('4 ');
insert into number_test values('3.14159');
insert into number_test values('127.0.0.1');
insert into number_test values('+34.45');
insert into number_test values('-54.43');
insert into number_test values('this is a test');
insert into number_test values('th1s is 4 t3st');
insert into number_test values('.');
commit;

Now select only columns where the value is a number:

select alphacol
from number_test
where regexp_instr(trim(alphacol),'^[-+]?[0-9]*(\.?[0-9]+)?$') > 0
order by 1

SQL> /

ALPHACOL
--------------------
3
+34.45
-54.43
1
2
3.14159
4

7 rows selected.

That seems to work.

But what if you're stuck doing this on 9i? REGEXP_INSTR is not available.

You can use the user defined function IS_NUMBER(), which works well, but is very slow if used on large amounts of data.

Might we be able to use and abuse the TRANSLATE() function to speed this up? Here's a bit of convoluted SQL that works well on the limited test data:

select alphacol, alpha2
from
(
select alphacol,
-- is there a sign +- ? - remove it
decode(substr(alpha2,1,1),
'-',substr(alpha2,2),
'+',substr(alpha2,2),
alpha2
) alpha2
from (
select
alphacol,
-- remove a single '.' if it/they exists
replace(substr(alphacol,1,instr(alphacol,'.')),'.') || substr(alphacol,instr(alphacol,'.')+1) alpha2
from (
select trim(alphacol) alphacol
from number_test
)
)
)
where substr('||||||||||||||||||||||||||||||||',1,length(alpha2)) = translate(alpha2,'0123456789','||||||||||')
/


(Sorry about formatting - I seem to lose all formatting when I paste SQL)

Output from this nasty bit of SQL is identical to that when using REGEXP_INSTR:

ALPHACOL ALPHA2
-------------------- ----------------------------------------
.5 5
1 1
2 2
3 3
4 4
3.14159 314159
+34.45 3445
-54.43 5443

8 rows selected.

To make the TRANLATE() function do what is needed, a lot of data manipulation had to be done in the SQL. There is so much work being done now that it now takes nearly as long to run as does the IS_NUMBER() function, so there isn't much point in using TRANSLATE().

Runstats results:

SQL> @th5
.047739 secs
.037447 secs
PL/SQL procedure successfully completed.

If nothing else, this was an interesting exercise.

No comments: