It’s been a fun and highly educational week for me. I’d like to say thank you to the folks at NoCOUG for putting together this year’s Training Day seminar. I feel privileged that I could attend this seminar which in my opinion was a definite bang for each buck and minute spent. The seminar was on Advanced Troubleshooting, conducted by none other than Tanel Poder.
A developer had experienced what he felt must definitely be a database issue two days ago. He listed the problem as a transient database problem that must be the db since the app could not be to blame! The developer was insisting that a SQL statement running a SELECT statement was at fault.
What were the sessions doing? What wait events were generated? What SQL was being used? It hadn’t been a day proper of digesting all the info Tanel shared and I got to apply it (albeit at 3am).
Note: You need the Diagnostic Pack license to use the DBA_HIST* tables, if you don’t have a license visit ashmasters.com for a free implementation.
Here’s how the problem was identified:
1) Got a timestamp from the developer
2) Queried V$ACTIVE_SESSION_HISTORY , the data had already aged out so I decided to use the DBA_HIST tables.
3) queried DBA_HIST_SNAPSHOT using the time range the developer provided to get the SNAP_IDs to use, I didn’t want to have to join just to save some typing and the window was small enough to get one SNAP_ID.
select SNAP_ID from DBA_HIST_SNAPSHOT
where BEGIN_INTERVAL_TIME >= to_date('11-10-2009 17:00:00','MM-DD-YYYY HH24:MI:SS') and
END_INTERVAL_TIME <= to_date('11-10-2009 18:30:00','MM-DD-YYYY HH24:MI:SS');
SNAP_ID
-------
281
4) I’ve successfully pushed for develeopers to use DBMS_APPLICATION_INFO to set session MODULE, CLIENT_INFO and ACTION, using this with the SNAP_ID I could then extract the top wait events for my specific sessions using MODULE in the following query:
select EVENT, SUM(TIME_WAITED) from DBA_HIST_ACTIVE_SESS_HISTORY
where SNAP_ID = 281 and MODULE = 'Component 10'
group by event
order by 2 asc, 1 desc;
EVENT SUM(TIME_WAITED)
------------------------------- ----------------
1
buffer busy waits 190
SQL*Net more data to client 3219
enq: TX - index contention 22776
log file sync 60272
db file sequential read 2032567
read by other session 3580430
Of all the events returned I knew that I could ignore the last three since our app produces very high data churn and consequently a lot of redo. This left “enq: TX – index contention” as the next event. This is produced when several sessions are trying to get an ITL entry to UPDATE rows in the same block. Remember, the developer said it was a SELECT that was misbehaving. Since this was an enqueue, it clearly could not be a SELECT statement doing this.
5) I then got more info on the session to see what the sqlid and blocking session were:
select SQL_ID, EVENT, TIME_WAITED,
P1TEXT||': '||to_char(P1) as P1,
P2TEXT||': '||to_char(P2) as P2,
P3TEXT||': '||to_char(P3) as P3,
BLOCKING_SESSION
from DBA_HIST_ACTIVE_SESS_HISTORY
where SNAP_ID = 281 and
MODULE = 'Component 10' and EVENT = 'enq: TX - index contention';
SQL_ID EVENT TIME_WAITED BLOCKING_SESSION
------------- ---------------------------------------- ----------- ----------------
P1
-------------------------
P2
-------------------------
P3
-------------------------
49717m971cn43 enq: TX - index contention 22466 3710
name|mode: 1415053316
usn<<16 | slot: 720925
sequence: 2949230
/* Note the SQL_ID was: 49717m971cn43 */
6) I then decided to look at what segments were being used and had ITL waits greater than zero in the same window:
select OBJ#, sum(ITL_WAITS_TOTAL)
from DBA_HIST_SEG_STAT
where snap_id = 22704
group by obj#
having sum(ITL_WAITS_TOTAL) > 0
order by 2 asc, 1 desc;
OBJ# SUM(ITL_WAITS_TOTAL)
---------- --------------------
280241 1
348216 7
360381 12
360378 18
279246 24
360379 456
7) The very last OBJ# turned out to be the Primary Key Index on a table.
select OBJECT_NAME from DBA_OBJECTS
where OBJECT_ID = 360379;
OBJECT_NAME
------------------
PK_COMPONENT_TAB
8 ) I then looked at the SQL_ID obtained in step 5 above: 49717m971cn43
select SQL_ID, SQL_TEXT from DBA_HIST_SQLTEXT
where SQL_ID = '49717m971cn43';
SQL_ID
---------
49717m971cn43
SQL_TEXT
--------
UPDATE COMPONENT_TAB SET CODEC = -1 WHERE rowid IN (:r0, :r1, :r2, :r3, :r4
, :r5, :r6, :r7, :r8, :r9, :r10, :r11, :r12, :r13, :r14, :r15, :r16, :r17, :r18,
:r19, :r20, :r21, :r22, :r23, :r24, :r25, :r26, :r27, :r28, :r29, :r30, :r31, :
r32, :r33, :r34, :r35, :r36, :r37, :r38, :r39, :r40, :r41, :r42, :r43, :r44,
:r45, :r46, :r47, :r48, :r49, :r50, :r51, :r52, :r53,
.....snipped for brevity ....
:r498, :r499, :r500);
9) I then checked SQL_IDs for other instances of the same component and they were running the exact same SQL.
10) Solution:
- Many instances of the same SQL update statement were running.
- They all had a large range of rows to update and these sets overlapped, naturally these produced enqueues since these updates update the primary key column and hence the index.
- INI_TRANS was set to 5, and we could not raise it to 500, the developer was asked to be less aggressive on the range of rows, or allow each component to update disticnt rows.
NoCOUG / Tanels’ training provided the time and additional insight needed to think things through. Thanks Tanel, thanks NoCOUG and lastly, thank you to my employer for supporting me in my attendance.