怎么理解PostgreSQL中Clock Sweep算法

2025-03-12 技术教程

本篇内容介绍了“怎么理解PostgreSQL中Clock Sweep算法”的有关知识，在实际案例的操作过程中，不少人都会遇到这样的困境，接下来就让小编带领大家学习一下如何处理这些情况吧！希望大家仔细阅读，能够学有所成！

一、数据结构

BufferDesc
共享缓冲区的共享描述符(状态)数据

/**Flagsforbufferdescriptors*buffer描述器标记**Note:TAG_VALIDessentiallymeansthatthereisabufferhashtable*entryassociatedwiththebuffer'stag.*注意:TAG_VALID本质上意味着有一个与缓冲区的标记相关联的缓冲区散列表条目。*///bufferheader锁定#defineBM_LOCKED(1U<<22)/*bufferheaderislocked*///数据需要写入(标记为DIRTY)#defineBM_DIRTY(1U<<23)/*dataneedswriting*///数据是有效的#defineBM_VALID(1U<<24)/*dataisvalid*///已分配buffertag#defineBM_TAG_VALID(1U<<25)/*tagisassigned*///正在R/W#defineBM_IO_IN_PROGRESS(1U<<26)/*readorwriteinprogress*///上一个I/O出现错误#defineBM_IO_ERROR(1U<<27)/*previousI/Ofailed*///开始写则变DIRTY#defineBM_JUST_DIRTIED(1U<<28)/*dirtiedsincewritestarted*///存在等待solepin的其他进程#defineBM_PIN_COUNT_WAITER(1U<<29)/*havewaiterforsolepin*///checkpoint发生,必须刷到磁盘上#defineBM_CHECKPOINT_NEEDED(1U<<30)/*mustwriteforcheckpoint*///持久化buffer(不是unlogged或者初始化fork)#defineBM_PERMANENT(1U<<31)/*permanentbuffer(notunlogged,*orinitfork)*//**BufferDesc--shareddescriptor/statedataforasinglesharedbuffer.*BufferDesc--共享缓冲区的共享描述符(状态)数据**Note:Bufferheaderlock(BM_LOCKEDflag)mustbeheldtoexamineorchange*thetag,stateorwait_backend_pidfields.Ingeneral,bufferheaderlock*isaspinlockwhichiscombinedwithflags,refcountandusagecountinto*singleatomicvariable.Thislayoutallowustodosomeoperationsina*singleatomicoperation,withoutactuallyacquiringandreleasingspinlock;*forinstance,increaseordecreaserefcount.buf_idfieldneverchanges*afterinitialization,sodoesnotneedlocking.freeNextisprotectedby*thebuffer_strategy_locknotbufferheaderlock.TheLWLockcantakecare*ofitself.Thebufferheaderlockis*not*usedtocontrolaccesstothe*datainthebuffer!*注意:必须持有Bufferheader锁(BM_LOCKED标记)才能检查或修改tag/state/wait_backend_pid字段.*通常来说,bufferheaderlock是spinlock,它与标记位/参考计数/使用计数组合到单个原子变量中.*这个布局设计允许我们执行原子操作,而不需要实际获得或者释放spinlock(比如,增加或者减少参考计数).*buf_id字段在初始化后不会出现变化,因此不需要锁定.*freeNext通过buffer_strategy_lock锁而不是bufferheaderlock保护.*LWLock可以很好的处理自己的状态.*务请注意的是:bufferheaderlock不用于控制buffer中的数据访问!**It'sassumedthatnobodychangesthestatefieldwhilebufferheaderlock*isheld.Thusbufferheaderlockholdercandocomplexupdatesofthe*statevariableinsinglewrite,simultaneouslywithlockrelease(cleaning*BM_LOCKEDflag).Ontheotherhand,updatingofstatewithoutholding*bufferheaderlockisrestrictedtoCAS,whichinsurethatBM_LOCKEDflag*isnotset.Atomicincrement/decrement,OR/ANDetc.arenotallowed.*假定在持有bufferheaderlock的情况下,没有人改变状态字段.*持有bufferheaderlock的进程可以执行在单个写操作中执行复杂的状态变量更新,*同步的释放锁(清除BM_LOCKED标记).*换句话说,如果没有持有bufferheaderlock的状态更新,会受限于CAS,*这种情况下确保BM_LOCKED没有被设置.*比如原子的增加/减少(AND/OR)等操作是不允许的.**Anexceptionisthatifwehavethebufferpinned,itstagcan'tchange*underneathus,sowecanexaminethetagwithoutlockingthebufferheader.*Also,inplaceswedoone-timereadsoftheflagswithoutbotheringto*lockthebufferheader;thisisgenerallyforsituationswherewedon't*expecttheflagbitbeingtestedtobechanging.*一种例外情况是如果我们已有bufferpinned,该buffer的tag不能改变(在本进程之下),*因此不需要锁定bufferheader就可以检查tag了.*同时,在执行一次性的flags读取时不需要锁定bufferheader.*这种情况通常用于我们不希望正在测试的flagbit将被改变.**Wecan'tphysicallyremoveitemsfromadiskpageifanotherbackendhas*thebufferpinned.Hence,abackendmayneedtowaitforallotherpins*togoaway.ThisissignaledbystoringitsownPIDinto*wait_backend_pidandsettingflagbitBM_PIN_COUNT_WAITER.Atpresent,*therecanbeonlyonesuchwaiterperbuffer.*如果其他进程有bufferpinned,那么进程不能物理的从磁盘页面中删除items.*因此,后台进程需要等待其他pins清除.这可以通过存储它自己的PID到wait_backend_pid中,*并设置标记位BM_PIN_COUNT_WAITER.*目前,每个缓冲区只能由一个等待进程.**Weusethissamestructforlocalbufferheaders,butthelocksarenot*usedandnotalloftheflagbitsareusefuleither.Toavoidunnecessary*overhead,manipulationsofthestatefieldshouldbedonewithoutactual*atomicoperations(i.e.onlypg_atomic_read_u32()and*pg_atomic_unlocked_write_u32()).*本地缓冲头部使用同样的结构,但并不需要使用locks,而且并不是所有的标记位都使用.*为了避免不必要的负载,状态域的维护不需要实际的原子操作*(比如只有pg_atomic_read_u32()andpg_atomic_unlocked_write_u32())**Becarefultoavoidincreasingthesizeofthestructwhenaddingor*reorderingmembers.Keepingitbelow64bytes(themostcommonCPU*cachelinesize)isfairlyimportantforperformance.*在增加或者记录成员变量时,小心避免增加结构体的大小.*保持结构体大小在64字节内(通常的CPU缓存线大小)对于性能是非常重要的.*/typedefstructBufferDesc{//buffertagBufferTagtag;/*IDofpagecontainedinbuffer*///buffer索引编号(0开始),指向相应的bufferpoolslotintbuf_id;/*buffer'sindexnumber(from0)*//*stateofthetag,containingflags,refcountandusagecount*///tag状态,包括flags/refcount和usagecountpg_atomic_uint32state;//pin-count等待进程IDintwait_backend_pid;/*backendPIDofpin-countwaiter*///空闲链表链中下一个空闲的bufferintfreeNext;/*linkinfreelistchain*///缓冲区内容锁LWLockcontent_lock;/*tolockaccesstobuffercontents*/}BufferDesc;

BufferTag
Buffer tag标记了buffer存储的是磁盘中哪个block

/**Buffertagidentifieswhichdiskblockthebuffercontains.*Buffertag标记了buffer存储的是磁盘中哪个block**Note:theBufferTagdatamustbesufficienttodeterminewheretowritethe*block,withoutreferencetopg_classorpg_tablespaceentries.It's*possiblethatthebackendflushingthebufferdoesn'tevenbelievethe*relationisvisibleyet(itsxactmayhavestartedbeforethexactthat*createdtherel).Thestoragemanagermustbeabletocopeanyway.*注意:BufferTag必须足以确定如何写block而不需要参照pg_class或者pg_tablespace数据字典信息.*有可能后台进程在刷新缓冲区的时候深圳不相信关系是可见的(事务可能在创建rel的事务之前).*存储管理器必须可以处理这些事情.**Note:ifthere'sanypadbytesinthestruct,INIT_BUFFERTAGwillhave*tobefixedtozerothem,sincethisstructisusedasahashkey.*注意:如果在结构体中有填充的字节,INIT_BUFFERTAG必须将它们固定为零，因为这个结构体用作散列键.*/typedefstructbuftag{//物理relation标识符RelFileNodernode;/*physicalrelationidentifier*/ForkNumberforkNum;//相对于relation起始的块号BlockNumberblockNum;/*blknumrelativetobeginofreln*/}BufferTag;二、源码解读

算法思想,在Buffer Manager中已有详细介绍,摘录如下:

WHILEtrue(1)ObtainthecandidatebufferdescriptorpointedbythenextVictimBuffer(2)IFthecandidatedescriptorisunpinnedTHEN(3)IFthecandidatedescriptor'susage_count==0THENBREAKWHILELOOP/*thecorrespondingslotofthisdescriptorisvictimslot.*/ELSEDecreasethecandidatedescriptpor'susage_countby1ENDIFENDIF(4)AdvancenextVictimBuffertothenextoneENDWHILE(5)RETURNbuffer_idofthevictim

算法思想朴素且简单,比起高大上的LRU算法要简单多了.
nextVictimBuffer可视为时钟的时针,把缓冲区视为环形缓冲区,时针循环周而往复的循环,如缓冲区满足unpinned(ref_count == 0) && usage_count == 0的条件,则选中该缓冲,否则,usage_count减一,继续循环,直至找到满足条件的buffer为止.选中的buffer一定是buffers中较少使用的那个.

StrategyGetBuffer中的代码片段:

.../*Nothingonthefreelist,sorunthe"clocksweep"algorithm*///空闲链表中找不到或者满足不了条件,则执行"clocksweep"算法//intNBuffers=1000;trycounter=NBuffers;//尝试次数for(;;){//-------循环//获取buffer描述符buf=GetBufferDescriptor(ClockSweepTick());/**Ifthebufferispinnedorhasanonzerousage_count,wecannotuse*it;decrementtheusage_count(unlesspinned)andkeepscanning.*如果buffer已pinned,或者有一个非零值的usage_count,不能使用这个buffer.*减少usage_count(除非已pinned)继续扫描.*/local_buf_state=LockBufHdr(buf);if(BUF_STATE_GET_REFCOUNT(local_buf_state)==0){//-----refcount==0if(BUF_STATE_GET_USAGECOUNT(local_buf_state)!=0){//usage_count<>0//usage_count-1local_buf_state-=BUF_USAGECOUNT_ONE;//重置尝试次数trycounter=NBuffers;}else{//usage_count=0/*Foundausablebuffer*///发现一个可用的bufferif(strategy!=NULL)//添加到该策略的环形缓冲区中AddBufferToRing(strategy,buf);//输出参数赋值*buf_state=local_buf_state;//返回bufreturnbuf;}}elseif(--trycounter==0){//-----refcount<>0&&--trycounter==0/**We'vescannedallthebufferswithoutmakinganystatechanges,*soallthebuffersarepinned(orwerewhenwelookedatthem).*Wecouldhopethatsomeonewillfreeoneeventually,butit's*probablybettertofailthantoriskgettingstuckinan*infiniteloop.*在没有改变任何状态的情况,我们已经完成了所有buffers的遍历,*因此所有的buffers已pinned(或者在搜索的时候pinned).*我们希望某些进程会周期性的释放buffer,但如果实在拿不到,那报错总比傻傻的死循环要好.*/UnlockBufHdr(buf,local_buf_state);elog(ERROR,"nounpinnedbuffersavailable");}//解锁bufferheaderUnlockBufHdr(buf,local_buf_state);}

ClockSweepTick()函数是StrategyGetBuffer()的辅助过程,移动时针(当前位置往前一格),返回时针指向的buffer.
其实现逻辑如下:

/**ClockSweepTick-HelperroutineforStrategyGetBuffer()*ClockSweepTick-StrategyGetBuffer()的辅助过程**Movetheclockhandonebufferaheadofitscurrentpositionandreturnthe*idofthebuffernowunderthehand.*移动时针(当前位置往前一格),返回时针指向的buffer.*/staticinlineuint32ClockSweepTick(void){uint32victim;/**Atomicallymovehandaheadonebuffer-ifthere'sseveralprocesses*doingthis,thiscanleadtobuffersbeingreturnedslightlyoutof*apparentorder.*原子移动时针一格*-如果有多个进程执行这个操作,这可能会导致缓冲返回的顺序稍微有些混乱.**/victim=pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer,1);if(victim>=NBuffers){//--------候选buffer大于NBuffers//记录元素的victimuint32originalVictim=victim;/*alwayswrapwhatwelookupinBufferDescriptors*///回卷BufferDescriptors中查找的内容victim=victim%NBuffers;/**Ifwe'retheonethatjustcausedawraparound,force*completePassestobeincrementedwhileholdingthespinlock.We*needthespinlocksoStrategySyncStart()canreturnaconsistent*valueconsistingofnextVictimBufferandcompletePasses.*如果我们正好导致了wraparound,在持有自旋锁的情况下强制completePasses增加.*之所以需要自旋锁是因为StrategySyncStart()才能返回nextVictimBuffer和completePasses的一致性值.*/if(victim==0){//出现回卷uint32expected;//期望值(不考虑回卷)uint32wrapped;//回卷后的值boolsuccess=false;//成功标记//期望值expected=originalVictim+1;while(!success){/**AcquirethespinlockwhileincreasingcompletePasses.That*allowsotherreaderstoreadnextVictimBufferand*completePassesinaconsistentmannerwhichisrequiredfor*StrategySyncStart().Intheorydelayingtheincrement*couldleadtoanoverflowofnextVictimBuffers,butthat's*highlyunlikelyandwouldn'tbeparticularlyharmful.*在增加completePasses时请求获取自旋锁.*这样可以让其他读取进程以一致性的方式读取nextVictimBuffer和completePasses,*这种一致性的方式在StrategySyncStart()中是需要的.*理论上来说,延迟增加可能会导致nextVictimBuffers溢出,*但但这是非常不可能的，也不会特别有害。*/SpinLockAcquire(&StrategyControl->buffer_strategy_lock);//获取回卷数wrapped=expected%NBuffers;//原子比较并交换数字success=pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,&expected,wrapped);if(success)//如成功,则增加计数StrategyControl->completePasses++;//释放自旋锁SpinLockRelease(&StrategyControl->buffer_strategy_lock);}}}//返回victimreturnvictim;}/**pg_atomic_compare_exchange_u32-CASoperation*pg_atomic_compare_exchange_u32-CAS操作**Atomicallycomparethecurrentvalueofptrwith*expectedandstorenewval*iffptrand*expectedhavethesamevalue.Thecurrentvalueof*ptrwill*alwaysbestoredin*expected.*原子比较ptr当前值与*expected,如果ptr和*expected值一致,则存储newval到ptr中.**ptr的当前值通常会存储在*expected中.**Returntrueifvalueshavebeenexchanged,falseotherwise.*如值已交换,则返回T,否则返回F.**Fullbarriersemantics.*完整的屏障语义.*/staticinlineboolpg_atomic_compare_exchange_u32(volatilepg_atomic_uint32*ptr,uint32*expected,uint32newval){AssertPointerAlignment(ptr,4);AssertPointerAlignment(expected,4);returnpg_atomic_compare_exchange_u32_impl(ptr,expected,newval);}boolpg_atomic_compare_exchange_u32_impl(volatilepg_atomic_uint32*ptr,uint32*expected,uint32newval){boolret;/**Doatomicopunderaspinlock.Itmightlooklikewecouldjustskip*thecmpxchgifthelockisn'tavailable,butthat'djustemulatea*'weak'compareandswap.I.e.onethatallowsspuriousfailures.Since*severalalgorithmsrelyonastrongvariantandthatisefficiently*implementableonmostmajorarchitectureslet'semulateithereas*well.*在自旋锁保护下执行原子操作.*这看起来如果锁不可能的话,我们可以跳过cmpxchg,但这只是模拟了一个"浅"比较和交换.*比如,这会引起spuriousfailures.*由于有几种算法依赖于一种强大的变体，而且这种变体可以在大多数主要架构上有效地实现，*因此我们在这里也对其进行模拟。*/SpinLockAcquire((slock_t*)&ptr->sema);/*performcompare/exchangelogic*///执行比较/交换逻辑ret=ptr->value==*expected;//ptr与*expected是否一致?*expected=ptr->value;//*expected赋值为ptrif(ret)ptr->value=newval;//值一致,则ptr设置为新值/*andreleaselock*///释放自旋锁SpinLockRelease((slock_t*)&ptr->sema);//返回结果returnret;}

“怎么理解PostgreSQL中Clock Sweep算法”的内容就介绍到这里了，感谢大家的阅读。如果想了解更多行业相关的知识可以关注亿速云网站，小编将为大家输出更多高质量的实用文章！