Ora? err... Ora ERR... Oracle!

Searching for that elusive solution to the often heard ORA-XXXXX error, and finding interesting tidbits along the way!

Wednesday, November 09, 2005

The Advantages of Patching Part II

This is a second part to "The Advantages of Patching". See the previous entry for more details.

Ok, now that I've gotten the patching issue over with, what's the next step? I'm going to do some testing against the application, of course.


EVENT/ERROR:
Let's see. Running a standard process... yes.. yes... doing good.. should be 20 minutes.. seems a bit longer than usual, but still acceptable... going on to the next process.. oh, that's fine too.. doesn't seem like there's any changes... going on to the next... should be 2 minutes... 5 minutes... waiting... 30 minutes... waiting... waiting... waiting... (after 6 hours) ... still waiting... Oh my. This is NOT GOOD.


[STEPS TO A] SOLUTION (1):
Turning on the feature to get the tracefiles*, and gives the the following contents:

WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=64377 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=64384 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=64414 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=64428 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=64436 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=64441 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=66425 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=67596 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=67751 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=68079 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=70166 p3=1
WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=70380 p3=1
<...>


I know this development server's I/O bandwidth is slower compared to others, but this is ridiculous! Finally curious what the tracefile is saying (tkprof doesnt usually help me on cases where one process is running perpetually forever) and its about time I learned to decipher it out myself. Metalink does provide Note 39817.1 for this. And this little snippet from the raw tracefile:

WAIT #4: nam='db file sequential read' ela= 0 p1=7 p2=64377 p3=1


Is to be taken as :

Waiting under cursor#4
for "db file sequential read"
We waited for 0.00 seconds
For a read of File 7 starting at block 64377 for 1 block

So on and so forth. Unfortunately, it seems that the process just keeps on reading and reading and reading indexes (this is actually 1 b-tree index from a table) and its just keeps on reading single blocks, all the way through since the whole process (at least when I started the trace) started!


[STEPS TO A] SOLUTION (2):
A bit put off (actually an understatement. By this time I was fuming because I was again in a helpless state of incredulity), I finally decided to see how fast this same database would run on another faster development server. I decided to hit two birds with one stone: (a) run the same query on the same server and see if the faster server will help with the slow processing time and (b) see if an unpatched Oracle will do any better.

After much file moving and restoring, I proceeded to run the same SQL query. And the answer comes back to me in less than a second.

By now I was suspicious. Despite the fact the other development server is faster, it is not THAT fast. And, I was already curious at this point. Could it be that the patch somehow skewed something? Although there are other "compulsory" fixes included in the patch, and it does have a fix for "2363247 - UNNECESSARY BITMAP PATH GENERATION AND BITMAP MEM ALLOCATION ", it is a fix, and not an existing bug. I've read and re-read the patch release notes but there seems to be no answer for my question. BUT, it could also not be my "db preparation" that I did for the past 5 days, since I've done this sort of "db preparation" before and it never affected the execution time this drastically.


[STEPS TO A] SOLUTION (3):
First thing done? I extracted the SQL in chunks (it was rather a longish SQL statement, but nothing that would've normally aroused any suspicion or ring any bells) to check and see which exactly in the SQL statement affected this whole process slowly. The statement was something like this:

SELECT A,B,C
FROM
(
SELECT T1.A,T1.B,T1.C
FROM
(SELECT A,B,C,D FROM TABLE_FLAG WHERE FLAG_COL1 = 'E') T1,
(SELECT A,B,C,D FROM TABLE_MAIN WHERE MAIN_COL1 = 'F'
AND MAIN_COL2 > 0 AND MAIN_COL3 = 2 AND MAIN_COL4 = 3) T2
WHERE T1.A = T2.A AND T1.B = T2.B
AND T1.C = T2.C AND TRUNC(T2.D + NVL(T1.D,0) ) - TRUNC(SYSDATE) = -1

UNION
SELECT T1.A,T1.B,T1.C
FROM
(SELECT A,B,C,D FROM TABLE_FLAG WHERE FLAG_COL1 = 'G') T1,
(SELECT A,B,C,D FROM TABLE_MAIN WHERE MAIN_COL1 = 'F' AND MAIN_COL2 > 0
AND MAIN_COL3 = 2 AND MAIN_COL4 = 3 ) T2
WHERE T1.A = T2.A AND T1.B =T2.B
AND T1.C = T2.C AND TRUNC(T2.D + NVL(T1.D,0) ) - TRUNC(SYSDATE) = -1
)

From there I did the following:

1. Executed the inner SELECT statement for TABLE_FLAG (that was ok)
2. Executed the inner SELECT statement for TABLE_MAIN (that seemed ok too)
3. Executed both inner SELECT statements with the join (hmmmmm... now this is taking a long time)
4. Finally, with a SELECT statement on hand, I did an explain plan on the same SQL. This is what I got for the slow processing db:


ID OPERATION OPTIONS OBJECT_NODE OBJECT_NAME
-- ------------------- -------------- -------------- ----------------
0 SELECT STATEMENT
1 VIEW
2 SORT UNIQUE
3 UNION-ALL
4 NESTED LOOPS
5 TABLE ACCESS FULL TABLE_FLAG
6 TABLE ACCESS BY INDEX ROWID TABLE_MAIN
7 BITMAP CONVERSION TO ROWIDS
8 BITMAP AND
9 BITMAP CONVERSION FROM ROWIDS
10 INDEX RANGE SCAN TABLE_MAIN_IDX1
11 BITMAP CONVERSION FROM ROWIDS
12 INDEX RANGE SCAN TABLE_MAIN_IDX2
13 NESTED LOOPS
14 TABLE ACCESS FULL TABLE_FLAG
15 TABLE ACCESS BY INDEX ROWID TABLE_MAIN
16 BITMAP CONVERSION TO ROWIDS
17 BITMAP AND
18 BITMAP CONVERSION FROM ROWIDS
19 INDEX RANGE SCAN TABLE_MAIN_IDX1
20 BITMAP CONVERSION FROM ROWIDS
21 INDEX RANGE SCAN TABLE_MAIN_IDX2

And this is what I got from the fast database:

ID OPERATION           OPTIONS        OBJECT_NODE    OBJECT_NAME
-- ------------------- -------------- -------------- ----------------
0 SELECT STATEMENT
1 VIEW :Q60911003
2 SORT UNIQUE :Q60911003
3 UNION-ALL :Q60911002
4 HASH JOIN :Q60911002
5 TABLE ACCESS FULL :Q60911000 TABLE_FLAG
6 TABLE ACCESS FULL :Q60911002 TABLE_MAIN
7 HASH JOIN :Q60911002
8 TABLE ACCESS FULL :Q60911001 TABLE_FLAG
9 TABLE ACCESS FULL :Q60911002 TABLE_MAIN

Excuse me??? BITMAP CONVERSIONS?? WHAT THE HECK IS THAT?? I don't even HAVE any bitmap indexes in ANY of my tables!

I googled for bitmap conversions but it didn't help very much though. From what I've read online, these usually occur when the bitmap indexes are existing in the queried table are used for the access path. Unfortunately, both tables I am querying DO NOT HAVE ANY bitmap indexes at all. It seems the engine is doing this out of sheer whim. BUT. Its still interesting to know that b-tree to bitmap indexes conversion do happen. Will need to take note of that and watch it out in the future.


[STEPS TO A] SOLUTION (4):
I finally got sick of all this guessing and database juggling. I've resolved to find out if forcing the same SQL statement (in the slower executing db with default bitmap conversion) to do a full table scan instead of using the default access path (with the bitmap conversions) would result in the same, fast execution time. Not being a db developer and with the familiar like-the-back-of-my-hand on using hints, I googled for it and eventually composed my SQL carefully:

SELECT /*+ FULL (T1, T2)*/ T1.A,T1.B,T1.C
FROM
(SELECT A,B,C,D FROM TABLE_FLAG WHERE FLAG_COL1 = 'E') T1,
(SELECT A,B,C,D FROM TABLE_MAIN WHERE MAIN_COL1 = 'F'
AND MAIN_COL2 > 0 AND MAIN_COL3 = 2 AND MAIN_COL4 = 3) T2
WHERE T1.A = T2.A AND T1.B = T2.B
AND T1.C = T2.C AND TRUNC(T2.D + NVL(T1.D,0) ) - TRUNC(SYSDATE) = -1


Then I executed it, and the results got back to me in less than a minute. How interesting. And the db engine insists on using bitmap conversions (6 hours vs less than 1 minute)? It must've gone cuckoo along the way.


RESULTS:
As for now, it was down to whether my "5-day preparation for the db" was the one affecting the explain plan, or the patch itself. My best bet was the patch, as there was a bugfix within the patch that DID touch on bitmap indexes (the patch said that it was a patch that would fix the bug, but I would never have thought it'd be the other way around). I uninstalled the Oct2005 Critical Patch Update on the original development server I was testing.

I crossed my fingers.

I ran the explain plan.

And the heavens opened up. The plan was what it should be... a FULL TABLE SCAN! :)

After dancing around the office (well, ok, in my head) I tried older patches first to see if it affected the explain plan for this particular SQL statement in anyway. I ended up with April2005 (ok fine, I didnt try the patch after that).

How interesting. Learned DBMS_REPCAT_RGT, OWA/MODPLSQL, bitmap indexes, bitmap conversions, raw tracefiles, patches, hints? all by just patching? That seems a productive experience.

Moral Lesson? There's a couple:

(1) Always think of the first rule to adding any changes: Do it in a testing/development environment, and test the full functionality of your application, REGARDLESS of what Oracle says.
(2) Patching is good. Aside from making the database more secure and stable (if it works anyway) any dirt, rock or boulder that you'll encounter on your way will also be a good stepping point to learning something new.


*Acknowledgements to http://www.petefinnigan.com for "how to set trace..."

0 Comments:

Post a Comment

<< Home