Thursday, January 31, 2008

Detect numbers with TRANSLATE() - Take two

Last week I wrote about using TRANSLATE to detect numbers in data Using The TRANSLATE() function...

Andrew Clarke at Radio Free Tooting pointed out the shortcomings of using TRANSLATE() to detect numbers.

As I said earlier, all I needed to do was detect if the characters in a string were all digits or not, and I wanted it to be very fast.

But Andrew's remarks got me thinking - could translate be used to detect more complex numbers?

Here's the short list of requirements:

* Detect integers
* Detect numbers with decimal point ( 4.6, 0.2, .7)
* Detect negative and positive ( leading + or - )
* Reject text with more than 1 '.', such as an IP address ( 127.0.0.1 )
* Reject anything with alpha text

And comma's are considered as text. 99,324.1 would be alpha.

If you need to do this on 10g, no problem, as a regular expression can handle it.

Fist create some test data:

drop table number_test;

create table number_test( alphacol varchar2(20));

insert into number_test values('.5');
insert into number_test values('1');
insert into number_test values('2');
insert into number_test values(' 3');
insert into number_test values('4 ');
insert into number_test values('3.14159');
insert into number_test values('127.0.0.1');
insert into number_test values('+34.45');
insert into number_test values('-54.43');
insert into number_test values('this is a test');
insert into number_test values('th1s is 4 t3st');
insert into number_test values('.');
commit;

Now select only columns where the value is a number:

select alphacol
from number_test
where regexp_instr(trim(alphacol),'^[-+]?[0-9]*(\.?[0-9]+)?$') > 0
order by 1

SQL> /

ALPHACOL
--------------------
3
+34.45
-54.43
1
2
3.14159
4

7 rows selected.

That seems to work.

But what if you're stuck doing this on 9i? REGEXP_INSTR is not available.

You can use the user defined function IS_NUMBER(), which works well, but is very slow if used on large amounts of data.

Might we be able to use and abuse the TRANSLATE() function to speed this up? Here's a bit of convoluted SQL that works well on the limited test data:

select alphacol, alpha2
from
(
select alphacol,
-- is there a sign +- ? - remove it
decode(substr(alpha2,1,1),
'-',substr(alpha2,2),
'+',substr(alpha2,2),
alpha2
) alpha2
from (
select
alphacol,
-- remove a single '.' if it/they exists
replace(substr(alphacol,1,instr(alphacol,'.')),'.') || substr(alphacol,instr(alphacol,'.')+1) alpha2
from (
select trim(alphacol) alphacol
from number_test
)
)
)
where substr('||||||||||||||||||||||||||||||||',1,length(alpha2)) = translate(alpha2,'0123456789','||||||||||')
/


(Sorry about formatting - I seem to lose all formatting when I paste SQL)

Output from this nasty bit of SQL is identical to that when using REGEXP_INSTR:

ALPHACOL ALPHA2
-------------------- ----------------------------------------
.5 5
1 1
2 2
3 3
4 4
3.14159 314159
+34.45 3445
-54.43 5443

8 rows selected.

To make the TRANLATE() function do what is needed, a lot of data manipulation had to be done in the SQL. There is so much work being done now that it now takes nearly as long to run as does the IS_NUMBER() function, so there isn't much point in using TRANSLATE().

Runstats results:

SQL> @th5
.047739 secs
.037447 secs
PL/SQL procedure successfully completed.

If nothing else, this was an interesting exercise.

Sunday, January 27, 2008

RMAN Usage Survey

As part of a presentation I'm preparing, I would like to get an idea of RMAN usage in the Oracle Community.

If you can spare a couple of minutes, please fill out this 10 question multiple choice survey: RMAN Usage Survey

Update:

It seems I made a poor choice of online Survey site.

To get more than 100 responses to this survey, a "Professional" version must be purchased. At a rate of 1 or 2 surveys a year (for me) that is not exactly practical.

Sharing the Survey results also requires the Professional version.

Here's a summary of the results in MS Excel.

RMAN Usage Survey Results


Thank you!

Thursday, January 24, 2008

Using the TRANSLATE() Function to Detect Numeric Data

When I made the first post to this blog, I had several ideas for future entries.
Those seem to have vanished with the wind, as I could not remember any of them.
Google notebook to the rescue - I have started recording blog ideas.
Now I just need more time to write entries here.

Now that I've made my lame excuse for ignoring this for a few weeks, I will move on to the technical content.

We've recently had a requirement to move some data from one application into another. The apps are similar in purpose, but not identical.

One of challenges in doing this is to cleanup the data into a form that our production app can use.

The source of the data has allowed a revision number for part numbers to be an alphanumeric column in a table. As you well might guess, this 'feature' was somewhat abused.

The data for a single part number may contain numerous rows in a table, each with a different revision number. Some of the revision 'numbers' are numeric, some are alpha.

We are only interested in those that are numeric.

For example, consider the following example:


Part# Revision
----- ----------
123 AA
123 AB
123 AC
123 01
123 02
123 03



The parts we need to import into our system are only those with revisions of 01,02 or 03.

The problem was how to exclude non-numeric revisions in a table.

You might consider the following SQL statement adequate

select * from parts where revision in ('01','02','03',...)



The IN operator could be converted to use a table known numeric revisions

select * from parts where revision in (select revision from good_revisions)



That would work, but would require building the table, which I would rather not do.

The data has not all arrived at one fell swoop - the next batch of data could break this method unless the table good_revisions is maintained. Ugh.

Compounding the problem is that the data is not consistent.
The revision could be ' 2' or '2 ' or '02'.

It seemed that this would call for regular expression.
Had I built the migration schemas in 10g database I could have used the REGEXP_INSTR function to find the numeric only revisions.

As the application is still in 9i, I used a 9i database to build the migration schemas.
Exercising my flawless 20/20 hindsight I realized I should have used 10g for those schemas.

Too late to change now, not enough time to move everything.

The next choice is to use the OWA_PATTERN package to find numeric only revisions.

From a purely functional perspective, this works perfectly.

From a 'waiting around staring at the computer while a job finishes' it was less than perfect.
Using OWA_PATTERN practically guarantees that any SQL it is used in will be slow.

There had to be a faster way.

At that point the TRANSLATE function came to mind.
While I have always found this function rather obtuse, it seemed it might do the job.

The TRANSLATE function accepts a string, a list of characters to locate in the string, and a list of characters to translate them to.

How can that be used to detect numeric only character data?

It works by transforming all numeric characters into a known character, and checking to see if the string consists solely of that character.

An example is the best way to show this.

Create a test table:

create table detect_numeric
as
select cast(chr(a.rownumber) || chr(b.rownumber) as varchar2(2)) fake_number
from
( select rownum rownumber from all_Tables where rownum <= 1000 ) a, ( select rownum rownumber from all_Tables where rownum <= 1000 ) b where a.rownumber between 48 and 90 and b.rownumber between 48 and 90 /



Here's an example of using OWA_PATTERN to find numeric data:

SQL> l
1 select fake_number
2 from detect_numeric
3 where owa_pattern.amatch(fake_number,1,'^\d+$') > 0
4* order by 1
SQL> /

FA
--
00
01
...
98
99

100 rows selected.

Here's how to do the same thing using the TRANSLATE function

SQL> l
1 select fake_number
2 from detect_numeric
3 where '||' = translate(fake_number,'0123456789','||||||||||')
4* order by 1
SQL> /

FA
--
00
01
...
98
99

100 rows selected.

The TRANSLATE function works by changing all numeric characters to the '|' symbol.
If the returned value is '||' then this must be all numeric data.

There's a caveat with using this method. The character used in the TRANSLATE function must not appear in the data being checked.

This example is simplified in that it does not account for nulls, spaces or varying data lengths.
Nonetheless it works quite well.

Is it faster?

In the test I ran the TRANSLATE function is 2 orders or magnitude faster than when using OWA_PATTERN.

Tom Kyte's run_stats was used to compare the run times and resource usage of both methods.
Run Stats

Running both methods in loop 20 times yielded the following run times:

10.265529 secs
.010235 secs

PL/SQL procedure successfully completed.

OWA_PATTERN required 10 seconds, while the TRANSLATE function did the same thing in 0.01 seconds.

The resource usage was much better for TRANSLATE as well;

SQL> @run_stats

NAME RUN1 RUN2 DIFF
---------------------------------------- ---------- ---------- ----------
...
LATCH.SQL memory manager workarea list l 268 0 -268
atch

LATCH.checkpoint queue latch 640 0 -640
STAT...redo size 27764 28404 640
STAT...Elapsed Time 1028 3 -1025
STAT...recursive cpu usage 1029 3 -1026
STAT...session pga memory max 74048 65536 -8512
STAT...session uga memory 65464 0 -65464
STAT...session uga memory max 65464 0 -65464
STAT...session pga memory 74048 0 -74048

49 rows selected.

As long as was playing with this it seemed a good idea to test on 10g as well, using the REGEXP_INSTR function rather than the slow OWA_PATTERN package.

The results were surprising.

While REGEXPR_INSTR was very fast, TRANSLATE was still the faster method.

.048662 secs
.008829 secs

PL/SQL procedure successfully completed.

The TRANSLATE function appears to be another order of magnitude faster on 10gR2 as on 9i.

That should not be too big a surprise as there were many PL/SQL optimization included in 10gR2, but this was still somewhat unexpected.

------

Niall Litchfield Niall Litchfield's Blog wondered why I had not tried using an is_number function such as the one shown below.

create or replace function is_number( chk_data_in varchar2 )
return number
is
dummy number(38,4);
begin
dummy := to_number(chk_data_in);
return 1;
exception
when value_error then
return 0;
when others then
raise;
end;

Quite frankly, I forgot about it.

Here are timings for both 9i and 10g. As expected, TRANSLATE is still quite a bit faster.

9i:

SQL> @th3
.092713 secs
.009951 secs

PL/SQL procedure successfully completed.

10gR2:

SQL> @th3
.362097 secs
.008479 secs

PL/SQL procedure successfully completed.