技术人生系列 · 我和数据中心的故事——第一期
小y这个名字,是笔者临时想的一个笔名,其实没有什么特殊的含义,就暂且用他来代表我们这些为各个数据中心奉献自己青春的一群默默无闻的IT人吧!
小y今天要和大家分享的是一个疑难杂症的分析过程。如果大家有耐心读完这个案例,一定会或多或少有些收获,也就没浪费小y的一片苦心。
具体来说是一个应用间歇性局部挂起案例的分析过程,报告中将对Oracle数据库稳定运行的共性风险和隐患作出提醒。
据客户反映,应用会间歇性出现异常,包括insert单条记录在内的操作长时间无法完成,按照客户的说法,数据库内可能有“死锁”现象,希望能够找到问题发生的根因,提出解决方案,以避免问题再次发生。
2015年12月23日,问题再次发生,客户再次联系到小y,小y通过远程方式进行了信息收集和故障诊断,最终定位了问题的根本原因。
环境介绍:
操作系统>>>>>>>
INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7)
>>>>>>>>
SQL> oradebug setmypid
Statement processed.
SQL> oradebug hanganalyze 3
Hang Analysis in /oracle/admin/xxdb/udump/xxdb_ora_14136.trc
SQL>
SQL> oradebug dump systemstate 266
Statement processed.
SQL> oradebug tracefile_name
/oracle/admin/xxdb/udump/xxdb_ora_14136.trc
>>>>>>>>>>>> PROCESS 19: ---------------------------------------- SO: c00000003949b948, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=19, calls cur/top: c0000000397209b0/c0000000397209b0, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 121 last post received-location: kcbzww last process to post me: c000000039496148 1 22 last post sent: 0 0 121 last post sent-location: kcbzww last process posted by me: c000000039496148 1 22 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: c000000039529928 O/S info: user: oracle, term: UNKNOWN, ospid: 11880 OSD pid info: Unix process pid: 11880, image: oracle@ap-machine- *** 2015-12-22 10:34:53.431 Short stack dump: ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_pw_wait()+48<-pw_wait()+128<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktugur()+4416<-ktuchg()+1280<-ktbchg2()+704<-kdiins0()+267536<-kdiinsp()+320<-kauxsin()+2960<-insidx()+1744<-insrow()+1440<-insdrv()+960<-inscovexe()+1408<-insExecStmtExecIniEngine()+176<-insexe()+1040<-opiexe()+13776<-kpoal8()+3808<-opiodr()+2144<-ttcpip()+1680<-opitsk()+2368<-opiino()+1664<-opiodr()+2144<-opidrv()+1248<-sou2o()+240<-opimai_real()+496<-main()+240<-main_opd_entry()+80 ---------------------------------------- SO: c0000000396d80d8, type: 4, owner: c00000003949b948, flag: INIT/-/-/0x00 (session) sid: 315 trans: c0000000355b2a28, creator: c00000003949b948, flag: (100041) USR/- BSY/-/-/-/-/- DID: 0001-0013-00000027, short-term DID: 0000-0000-00000000 txn branch: 0000000000000000 oct: 2, prv: 0, sql: c0000000384bad50, psql: c0000000384bad50, user: 33/XXDB service name: xxdb O/S info: user: , term: , ospid: 1234, machine: rvwapp2-1 program: waiting for 'db file sequential read' wait_time=0, seconds since wait started=0 file#=c, block#=2f359, blocks=1 blocking sess=0x0000000000000000 seq=42271 Dumping Session Wait History for 'undo segment extension' count=1 wait_time=2 min 45 sec segment#=6, =0, =0 for 'buffer busy waits' count=1 wait_time=0.000009 sec file#=2, block#=59, class#=1b for 'SQL*Net message from client' count=1 wait_time=0.000087 sec driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=0.000001 sec driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=0.000086 sec driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=0.000001 sec driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=0.000086 sec driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=0.000001 sec driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=0.000087 sec driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=0.000001 sec driver id=28444553, #bytes=1, =0 Sampled Session History of session 315 serial 4544 --------------------------------------------------- The sampled session history is constructed by sampling the target session every 1 second. The sampling process captures at each sample if the session is in a non-idle wait, an idle wait, or not in a wait. If the session is in a non-idle wait then one interval is shown for all the samples the session was in the same non-idle wait. If the session is in an idle wait or not in a wait for consecutive samples then one interval is shown for all the consecutive samples. Though we display these consecutive samples in a single interval the session may NOT be continuously idle or not in a wait (the sampling process does not know). The history is displayed in reverse chronological order. sample interval: 1 sec, max history 120 sec --------------------------------------------------- [120 samples, 10:32:52 - 10:34:53] waited for 'undo segment extension', seq_num: 42270 p1: 'segment#'=0x6 p2: ''=0x0 p3: ''=0x0 time_waited: >= 120 sec (still in wait) --------------------------------------------------- Sampled Session History Summary: longest_non_idle_wait: 'undo segment extension' [120 samples, 10:32:52 - 10:34:53] time_waited: >= 120 sec (still in wait) 可以看到: 进程在被ORADEBUG间接唤醒后,不再等待undo segment extension,而是做” PROCESS 8: ---------------------------------------- SO: c000000039496148, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=8, calls cur/top: c00000003971e868/c00000003971e868, flag: (16) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 121 last post received-location: kcbzww last process to post me: c0000000394a0948 211 0 last post sent: 0 0 24 last post sent-location: ksasnd last process posted by me: c000000039495148 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: c000000039529928 O/S info: user: oracle, term: UNKNOWN, ospid: 10072 OSD pid info: Unix process pid: 10072, image: oracle@ap-machine- (SMON) Short stack dump: ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<kernel><-_pw_wait()+48<-pw_wait()+352<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktmmon()+1168<-ktmSmonMain()+64<-ksbrdp()+2368<-opirip()+1184<-opidrv()+1184<-sou2o()+240<-opimai_real()+336<-main()+240<-main_opd_entry()+80 ---------------------------------------- ---------------------------------------- SO: c0000000396ead68, type: 4, owner: c000000039496148, flag: INIT/-/-/0x00 (session) sid: 329 trans: 0000000000000000, creator: c000000039496148, flag: (100051) USR/- BSY/-/-/-/-/- DID: 0001-0008-00000002, short-term DID: 0000-0000-00000000 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'smon timer' wait_time=0, seconds since wait started=3109 sleep time=12c, failed=0, =0 blocking sess=0x0000000000000000 seq=7382 Dumping Session Wait History for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 11 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 for 'smon timer' count=1 wait_time=4 min 53 sec sleep time=12c, failed=0, =0 Sampled Session History of session 329 serial 1 可以看到,SMON进程在等“SMON TIMER”,即空闲等待。 这说明SMON没有因为阻塞在异常的等待上,导致无法腾出时间来处理前台进程发过来的undo segment扩展请求。 >>>> 操作系统 The problem is a defect in the pw_wait() O/S system call. 即HPUX11.31上,当安装了PHKL_37456这个调度补丁后,操作系统调用pw_wait存在缺陷,导致进程无法从post/wait中被唤醒,解决方案是Solution: on 11.31 install PHKL_38397 or later equivalent. 以下是官网原文 Processes Hang Waiting on 'cursor: pin S wait on X' (and other Wait Events) on HP-UX 11.23 and 11.31 Itanium Systems (Doc ID 580273.1) In this Document APPLIES TO: Oracle Database - Enterprise Edition - Version 9.2.0.1 and later cursor: pin S wait on X (10g wait) kksfbc child completion SGA: allocation forcing component growth There is no blocker process. 0: c0000000003e2ff0 : pw_wait() + 0x30 (/usr/lib/hpux64/libc.so.1) Workaround: oradebug setospid+unlimit to free the hanging process: 1. Solution: on 11.31 install PHKL_38397 or later equivalent. Solution: on 11.23 install PHKL_37809 or later equivalent. Note: the patches whose names begin with PHKL_ are HP-UX patches. Please check with HP-UX support on the latest applicable patch numbers as such patches are regularly updated and superceded. >>>> INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7) 之所以执行不下去,不是因为死锁,而是因为该会话即SID=315的会话,需要执行DML,因此需要UNDO SEGMENT即回滚段来存储前镜像,但发现回滚段空间不足,需要通知SMON后台进程来完成扩展的请求,但是长时间没有获得回滚段扩展成功与否的返回消息。前台进程和SMON之间通过POST/WAIT进制来通讯。 由于操作系统HPUX调度的缺陷,具体来说是pw_wait系统调用(post/wait)存在缺陷,当SMON进程完成回滚段扩展后,消息返回时,前台进程SID=315并没有能被及时调度到CPU上继续处理后续工作,即体现出来就是等待在“undo segment extension“事件上等待,当我们人为使用ORADEBUG对SID 315收集信息并间接唤醒了该前台进程后,前台进行可以继续往下工作,这更加印证了SMON已经将消息已经返回给前台进程,只是由于操作系统调度进制的问题,未能即使将前台进程调度到CPU上,拿到UNDO扩展成功的消息 造成该故障的根本原因是操作系统调度机制的问题,该问题命中操作系统HPUX上的已知缺陷。 The problem is a defect in the pw_wait() O/S system call. 即HPUX11.31上,当安装了PHKL_37456这个调度补丁后,操作系统调用pw_wait存在缺陷,导致进程无法从post/wait中被唤醒,解决方案是Solution: on 11.31 install PHKL_38397 or later equivalent. >>>>3.2 建议 Solution: on 11.31 install PHKL_38397 or later equivalent. 请系统管理员为操作系统安装PHKL_38397补丁。 本文是转载中亦安图的文章。
On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:
PHKL_37456 scheduler cumulative patch
HP-UX Itanium
***Checked for relevance on 23-JUL-2013***
The variety in wait events is explained by the fact that the underlying issue is with the pw_wait() HP-UX system call.
Therefore, the above list is not exhaustive, use the pstack output below to match the problem in all cases.
The problem is commonly seen on Parallel Execution Slave processes but can affect any Oracle process.
A command to generate a systemstate, processstate, or errorstack dump e.g. via ALTER SESSION or oradebug command frees the hanging process.
An example of the top of the call stack of an affected process, obtained using pstack:
1: 4000000002e98a20 : pw_wait() + 0x80
On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:
PHKL_37456 scheduler cumulative patch
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。