PostgreSQL同步复制主库挂起分析
这篇文章主要讲解了“PostgreSQL同步复制主库挂起分析”,文中的讲解内容简单清晰,易于学习与理解,下面请大家跟着小编的思路慢慢深入,一起来研究和学习“PostgreSQL同步复制主库挂起分析”吧!
在Streaming Replication环境中PostgreSQL主节点设置为同步复制,如standby节点没有启动或者网络出现问题没法连接到主节点时,主节点如执行DML则进程会挂起,下面分析这个挂起的问题.
一、数据结构Latch
Latch结构体应被视为opaque”不透明的”,并且只能通过公共的函数访问.在这里定义是运行把Latchs作为更大的结构体的一部分.
//通常情况下,int类型的变量通常是原子访问的,也可以认为sig_atomic_t就是int类型的数据,//因为对这些变量要求一条指令完成,所以sig_atomic_t不可能是结构体,只会是数字类型。typedefint__sig_atomic_t;/**Latchstructureshouldbetreatedasopaqueandonlyaccessedthrough*thepublicfunctions.ItisdefinedheretoallowembeddingLatchesas*partofbiggerstructs.*Latch结构体应被视为"不透明的"opaque,并且只能通过公共的函数访问.*在这里定义是运行把Latchs作为更大的结构体的一部分.*/typedefstructLatch{sig_atomic_tis_set;boolis_shared;intowner_pid;#ifdefWIN32HANDLEevent;#endif}Latch;二、源码解读
N/A
二、跟踪分析启动master节点,不启动standby节点,使用psql连接数据库,执行SQL,Session挂起:
testdb=#droptablet1;
使用gdb跟踪挂起的进程
[xdb@localhost~]$ps-ef|greppostgresxdb13181012:14pts/000:00:00/appdb/xdb/pg11.2/bin/postgresxdb13191318012:14?00:00:00postgres:loggerxdb13211318012:14?00:00:00postgres:checkpointerxdb13221318012:14?00:00:00postgres:backgroundwriterxdb13231318012:14?00:00:00postgres:walwriterxdb13241318012:14?00:00:00postgres:autovacuumlauncherxdb13251318012:14?00:00:00postgres:archiverxdb13261318012:14?00:00:00postgres:statscollectorxdb13271318012:14?00:00:00postgres:logicalreplicationlauncherxdb13311318012:15?00:00:00postgres:xdbtestdb[local]DROPTABLEwaitingfor0/5D07B668[xdb@localhost~]$gdb-p1331GNUgdb(GDB)RedHatEnterpriseLinux7.6.1-100.el7...
查看调用栈
(gdb)bt#00x00007f4636d48903in__epoll_wait_nocancel()from/lib64/libc.so.6#10x000000000088e668inWaitEventSetWaitBlock(set=0x21640e8,cur_timeout=-1,occurred_events=0x7ffc96572f40,nevents=1)atlatch.c:1048#20x000000000088e543inWaitEventSetWait(set=0x21640e8,timeout=-1,occurred_events=0x7ffc96572f40,nevents=1,wait_event_info=134217761)atlatch.c:1000#30x000000000088dcecinWaitLatchOrSocket(latch=0x7f462d5b44d4,wakeEvents=17,sock=-1,timeout=-1,wait_event_info=134217761)atlatch.c:385#40x000000000088dbcdinWaitLatch(latch=0x7f462d5b44d4,wakeEvents=17,timeout=-1,wait_event_info=134217761)atlatch.c:339#50x0000000000863e2dinSyncRepWaitForLSN(lsn=1560786536,commit=true)atsyncrep.c:286#60x0000000000546279inRecordTransactionCommit()atxact.c:1359#70x0000000000546da3inCommitTransaction()atxact.c:2074#80x0000000000547a3finCommitTransactionCommand()atxact.c:2817#90x00000000008be250infinish_xact_command()atpostgres.c:2523#100x00000000008bbf45inexec_simple_query(query_string=0x20a1d78"droptablet1;")atpostgres.c:1170#110x00000000008c0191inPostgresMain(argc=1,argv=0x20cdcd8,dbname=0x20cdb40"testdb",username=0x209ea98"xdb")atpostgres.c:4182#120x000000000081e06cinBackendRun(port=0x20c3b10)atpostmaster.c:4361#130x000000000081d7dfinBackendStartup(port=0x20c3b10)atpostmaster.c:4033#140x0000000000819bd9inServerLoop()atpostmaster.c:1706#150x000000000081948finPostmasterMain(argc=1,argv=0x209ca50)atpostmaster.c:1379#160x0000000000742931inmain(argc=1,argv=0x209ca50)atmain.c:228(gdb)
kill进程,重新进入在WaitLatch上设置断点进行跟踪
#########[xdb@localhost~]$kill-91331#########testdb=#selectpg_backend_pid();pg_backend_pid----------------1377(1row)#########[xdb@localhost~]$gdb-p1377...(gdb)bWaitLatchBreakpoint1at0x88dbac:filelatch.c,line339.(gdb)#########testdb=#droptablet1;ERROR:table"t1"doesnotexisttestdb=#createtablet1(idint);
进入断点
(gdb)bWaitLatchBreakpoint1at0x88dbac:filelatch.c,line339.(gdb)cContinuing.Breakpoint1,WaitLatch(latch=0x7f462d5b44d4,wakeEvents=17,timeout=-1,wait_event_info=134217761)atlatch.c:339339returnWaitLatchOrSocket(latch,wakeEvents,PGINVALID_SOCKET,timeout,(gdb)
进入WaitLatchOrSocket
(gdb)stepWaitLatchOrSocket(latch=0x7f462d5b44d4,wakeEvents=17,sock=-1,timeout=-1,wait_event_info=134217761)atlatch.c:359359intret=0;(gdb)(gdb)p*latch$1={is_set=0,is_shared=true,owner_pid=1377}
构建等待事件集
(gdb)n362WaitEventSet*set=CreateWaitEventSet(CurrentMemoryContext,3);(gdb)n364if(wakeEvents&WL_TIMEOUT)(gdb)367timeout=-1;(gdb)369if(wakeEvents&WL_LATCH_SET)(gdb)p*set$2={nevents=0,nevents_space=3,events=0x2181eb8,latch=0x0,latch_pos=0,epoll_fd=37,epoll_ret_events=0x2181f00}(gdb)p*set->events$3={pos=0,events=0,fd=0,user_data=0x0}(gdb)p*set->epoll_ret_events$4={events=0,data={ptr=0x0,fd=0,u32=0,u64=0}}(gdb)$5={events=0,data={ptr=0x0,fd=0,u32=0,u64=0}}(gdb)n370AddWaitEventToSet(set,WL_LATCH_SET,PGINVALID_SOCKET,(gdb)373if(wakeEvents&WL_POSTMASTER_DEATH&&IsUnderPostmaster)(gdb)374AddWaitEventToSet(set,WL_POSTMASTER_DEATH,PGINVALID_SOCKET,(gdb)377if(wakeEvents&WL_SOCKET_MASK)(gdb)385rc=WaitEventSetWait(set,timeout,&event,1,wait_event_info);(gdb)p*set$6={nevents=2,nevents_space=3,events=0x2181eb8,latch=0x7f462d5b44d4,latch_pos=0,epoll_fd=37,epoll_ret_events=0x2181f00}(gdb)p*set->events$7={pos=0,events=1,fd=11,user_data=0x0}(gdb)p*set->epoll_ret_events$8={events=0,data={ptr=0x0,fd=0,u32=0,u64=0}}(gdb)
进入WaitEventSetWait
(gdb)stepWaitEventSetWait(set=0x2181e90,timeout=-1,occurred_events=0x7ffc96572f40,nevents=1,wait_event_info=134217761)atlatch.c:925925intreturned_events=0;(gdb)
输入参数
(gdb)n928longcur_timeout=-1;(gdb)p*set$9={nevents=2,nevents_space=3,events=0x2181eb8,latch=0x7f462d5b44d4,latch_pos=0,epoll_fd=37,epoll_ret_events=0x2181f00}(gdb)p*occurred_events$10={pos=35135068,events=0,fd=-1772664741,user_data=0x7ffc96572fa0}(gdb)
执行相关判断和设置参数
(gdb)n930Assert(nevents>0);(gdb)936if(timeout>=0)(gdb)943pgstat_report_wait_start(wait_event_info);(gdb)946waiting=true;(gdb)
未有事件出现,则循环
951while(returned_events==0)(gdb)
不符合set->latch->is_set为T的条件,继续循环
982if(set->latch&&set->latch->is_set)(gdb)p*set->latch$11={is_set=0,is_shared=true,owner_pid=1377}(gdb)
进入WaitEventSetWaitBlock
(gdb)n1000rc=WaitEventSetWaitBlock(set,cur_timeout,(gdb)stepWaitEventSetWaitBlock(set=0x2181e90,cur_timeout=-1,occurred_events=0x7ffc96572f40,nevents=1)atlatch.c:10421042intreturned_events=0;(gdb)
调用epoll_wait,挂起
(gdb)n1048rc=epoll_wait(set->epoll_fd,set->epoll_ret_events,(gdb)p*set$12={nevents=2,nevents_space=3,events=0x2181eb8,latch=0x7f462d5b44d4,latch_pos=0,epoll_fd=37,epoll_ret_events=0x2181f00}(gdb)(gdb)n
启动standby节点
####[xdb@localhost~]$pg_ctlstartpg_ctl:anotherservermightberunning;tryingtostartserveranyway...
接收到信号
ProgramreceivedsignalSIGUSR1,Userdefinedsignal1.0x00007f4636d48903in__epoll_wait_nocancel()from/lib64/libc.so.6(gdb)(gdb)nSinglesteppinguntilexitfromfunction__epoll_wait_nocancel,whichhasnolinenumberinformation.procsignal_sigusr1_handler(postgres_signal_arg=-1)atprocsignal.c:262262{(gdb)
感谢各位的阅读,以上就是“PostgreSQL同步复制主库挂起分析”的内容了,经过本文的学习后,相信大家对PostgreSQL同步复制主库挂起分析这一问题有了更深刻的体会,具体使用情况还需要大家实践验证。这里是亿速云,小编将为大家推送更多相关知识点的文章,欢迎关注!
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。