Thursday, August 06, 2009

Detecting Corrupt Data Blocks

Or more accurately, how not to detect corrupt data blocks.

This thread on Oracle-L is regarding lost writes on a database.

One suggestion was made to use the exp utility to export the database, thereby determining if there are corrupt blocks in the database due to disk failure. I didn't give it much thought at first, but fellow Oak Table member Mark Farnham got me thinking about it.

Using exp to detect corrupt blocks, or rather, the absence of corrupt blocks may work, but then again, it may not. It is entirely possible to do a  full table scan on a table successfully, as would happen during an export, even though the blocks on disk have been corrupted.

This can be demonstrated by building a table, ensuring the contents are cached, then destroying the data in the data file, followed by a successful export of the table.

Granted, there are a lot of mitigating factors that could be taken into consideration as to whether or not this would happen in a production database. That's not the point: the point is that it could happen, so exp is not a reliable indicator of the state of the data files on disk.

This test was performed on Oracle EE on RH Linux ES 4. Both are 32 bit.

First create a test tablespace:

create tablespace lost_write datafile '/u01/oradata/dv11/lost_write.dbf' size 1m
extent management local
uniform size 64k

Next the table LOST_WRITE is created in the tablespace of the same name. This will be used to test the assertion that a successful export of the table can be done even though the data on disk is corrupt.

create table lost_write
tablespace lost_write
select * from dba_objects
where rownum <= 1000


select tablespace_name, blocks, bytes
from user_segments
where segment_name = 'LOST_WRITE'

------------------------------ ---------- ----------
LOST_WRITE 16 131072

1 row selected.

Next, do a full table scan and verify that the blocks are cached:

select * from lost_write;

Verify in cache:
select file#,block#,class#, status
from v$bh where ts# = (select ts# from sys.ts$ where name = 'LOST_WRITE')
order by block#

---------- ---------- ---------- -------
40          2         13 xcur
40          3         12 xcur
40          9          8 xcur
40         10          9 xcur
40         11          4 xcur
40         12          1 xcur
40         13          1 xcur
40         14          1 xcur
40         15          1 xcur
40         16          1 xcur
40         17          1 xcur
40         18          1 xcur
40         19          1 xcur
40         20          1 xcur
40         21          1 xcur
40         22          1 xcur
40         23          1 xcur

Now swap the bytes in the file, skipping the first 2 oracle blocks
Caveat: I don't know if that was the correct # of blocks, and I didn't spend any time trying to find out
Also, I belatedly saw that count probably should have been 22 rather than 16, but the results still served the purpose of corrupting the datafile, as we shall see in a bit.

What this dd command is doing is using the same file for both input and output, and rewriting blocks 3-18, swapping each pair of bytes.

dd if=/u01/oradata/dv11/lost_write.dbf of=/u01/oradata/dv11/lost_write.dbf bs=8129 skip=2 count=16 conv=swab,notrunc

The effect is demonstrated by this simple test:

jkstill-19 > echo hello | dd
0+1 records in
0+1 records out
[ /home/jkstill ]

jkstill-19 > echo hello | dd conv=swab
o0+1 records in
0+1 records out

Now we can attempt the export:

exp tables=\(jkstill.lost_write\) ...

Connected to: Oracle Database 10g Enterprise Edition Release - Production
With the Partitioning, Oracle Label Security, Data Mining and Real Application Testing options
Export done in WE8MSWIN1252 character set and AL16UTF16 NCHAR character set

About to export specified tables via Conventional Path ...
. . exporting table LOST_WRITE 1000 rows exported
Export terminated successfully without warnings.
19 > echo hello | dd
0+1 records in
0+1 records out
[ /home/jkstill ]

jkstill-19 > echo hello | dd conv=swab
o0+1 records in
0+1 records out

So, even though the data on disk has been corrupted, the export succeeded. That is due to the table being created with the CACHE option, and all the blocks being cached at the time of export. It may not be necessary to use the CACHE option, but I used it to ensure the test would succeed.

Now let's see what happens when trying to scan the table again. First the NOCACHE option will be set on the table, then a checkpoint.

10:42:45 dv11 SQL> alter table lost_write nocache;

10:43:02 dv11 SQL> alter system checkpoint;

Now try to scan the table again:

10:43:14 - js001292@dv11 SQL> /
select * from lost_write
ERROR at line 1:
ORA-00376: file 40 cannot be read at this time
ORA-01110: data file 40: '/u01/oradata/dv11/lost_write.dbf'

A corollary conclusion can drawn from this example.

If you do discover bad data blocks, you just might be able to do an export of the tables that are in the affected region before doing any recovery. This might be a belt and suspenders approach, but DBA's are not generally being known for taking unnecessary chances when possible data loss is on the line.


Andy Rivenes said...

Hi Jared, you raise a valid point, but assuming that RMAN can perform a valid backup (i.e. verifying that data blocks can be physically read) and then you attempt an export (i.e. verifying that the data can be logically read) I think you're in pretty good shape. dbverfiy could also be used, but it doesn't have the best reputation. I think what you really want is a direct read of all of your data blocks without the worry that corruption will occur during your check. Export can perform direct reads, but there have been problems with it.

Andy Rivenes said...

Hi again, I ran your test on and interestingly enough an export direct exported 593 rows, in my case, and reported no errors for the missing rows. An RMAN backup could not read the file header. The rest of your tests behaved the same. So, it looks like export fails all the way around. I'm wondering if the dd had only touched one block in the middle of the file if the results would have been the same. Unfortunately I don't have the time now to try it out.

Jared said...

Thanks for running more tests Andy. There were several others I could have run as well, but ran out of time also.

For that test, I was just trying to prove that the results found were indeed possible.