PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析

2025-03-11 技术教程

这篇文章主要介绍“PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析”，在日常操作中，相信很多人在PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析”的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

这些函数在HJ_NEED_NEW_OUTER阶段中使用，包括ExecHashJoinOuterGetTuple、ExecPrepHashTableForUnmatched、ExecHashGetBucketAndBatch、ExecHashGetSkewBucket、ExecHashJoinSaveTuple和ExecFetchSlotMinimalTuple等。

一、数据结构

Plan
所有计划节点通过将Plan结构作为第一个字段从Plan结构“派生”。这确保了在将节点转换为计划节点时，一切都能正常工作。(在执行器中以通用方式传递时，节点指针经常被转换为Plan *)

/*----------------*Plannode**Allplannodes"derive"fromthePlanstructurebyhavingthe*Planstructureasthefirstfield.Thisensuresthateverythingworks*whennodesarecasttoPlan's.(nodepointersarefrequentlycasttoPlan**whenpassedaroundgenericallyintheexecutor)*所有计划节点通过将Plan结构作为第一个字段从Plan结构“派生”。*这确保了在将节点转换为计划节点时，一切都能正常工作。*(在执行器中以通用方式传递时，节点指针经常被转换为Plan*)**WeneveractuallyinstantiateanyPlannodes;thisisjustthecommon*abstractsuperclassforallPlan-typenodes.*从未实例化任何Plan节点;这只是所有Plan-type节点的通用抽象超类。*----------------*/typedefstructPlan{NodeTagtype;//节点类型/**成本估算信息;estimatedexecutioncostsforplan(seecostsize.cformoreinfo)*/Coststartup_cost;/*启动成本;costexpendedbeforefetchinganytuples*/Costtotal_cost;/*总成本;totalcost(assumingalltuplesfetched)*//**优化器估算信息;planner'sestimateofresultsizeofthisplanstep*/doubleplan_rows;/*行数;numberofrowsplanisexpectedtoemit*/intplan_width;/*平均行大小(Byte为单位);averagerowwidthinbytes*//**并行执行相关的信息;informationneededforparallelquery*/boolparallel_aware;/*是否参与并行执行逻辑?engageparallel-awarelogic?*/boolparallel_safe;/*是否并行安全;OKtouseaspartofparallelplan?*//**Plan类型节点通用的信息.CommonstructuraldataforallPlantypes.*/intplan_node_id;/*uniqueacrossentirefinalplantree*/List*targetlist;/*targetlisttobecomputedatthisnode*/List*qual;/*implicitly-ANDedqualconditions*/structPlan*lefttree;/*inputplantree(s)*/structPlan*righttree;List*initPlan;/*InitPlannodes(un-correlatedexpr*subselects)*//**Informationformanagementofparameter-change-drivenrescanning*parameter-change-driven重扫描的管理信息.**extParamincludestheparamIDsofallexternalPARAM_EXECparams*affectingthisplannodeoritschildren.setParamparamsfromthe*node'sinitPlansarenotincluded,buttheirextParamsare.**allParamincludesalltheextParamparamIDs,plustheIDsoflocal*paramsthataffectthenode(i.e.,thesetParamsofitsinitplans).*Theseare_all_thePARAM_EXECparamsthataffectthisnode.*/Bitmapset*extParam;Bitmapset*allParam;}Plan;

JoinState
Hash/NestLoop/Merge Join的基类

/*----------------*JoinStateinformation**Superclassforstatenodesofjoinplans.*Hash/NestLoop/MergeJoin的基类*----------------*/typedefstructJoinState{PlanStateps;//基类PlanStateJoinTypejointype;//连接类型//在找到一个匹配innertuple的时候,如需要跳转到下一个outertuple,则该值为Tboolsingle_match;/*Trueifweshouldskiptonextoutertuple*afterfindingoneinnermatch*///连接条件表达式(除了ps.qual)ExprState*joinqual;/*JOINquals(inadditiontops.qual)*/}JoinState;

HashJoinState
Hash Join运行期状态结构体

/*thesestructsaredefinedinexecutor/hashjoin.h:*/typedefstructHashJoinTupleData*HashJoinTuple;typedefstructHashJoinTableData*HashJoinTable;typedefstructHashJoinState{JoinStatejs;/*基类;itsfirstfieldisNodeTag*/ExprState*hashclauses;//hash连接条件List*hj_OuterHashKeys;/*外表条件链表;listofExprStatenodes*/List*hj_InnerHashKeys;/*内表连接条件;listofExprStatenodes*/List*hj_HashOperators;/*操作符OIDs链表;listofoperatorOIDs*/HashJoinTablehj_HashTable;//Hash表uint32hj_CurHashValue;//当前的Hash值inthj_CurBucketNo;//当前的bucket编号inthj_CurSkewBucketNo;//行倾斜bucket编号HashJoinTuplehj_CurTuple;//当前元组TupleTableSlot*hj_OuterTupleSlot;//outerrelationslotTupleTableSlot*hj_HashTupleSlot;//HashtupleslotTupleTableSlot*hj_NullOuterTupleSlot;//用于外连接的outer虚拟slotTupleTableSlot*hj_NullInnerTupleSlot;//用于外连接的inner虚拟slotTupleTableSlot*hj_FirstOuterTupleSlot;//inthj_JoinState;//JoinState状态boolhj_MatchedOuter;//是否匹配boolhj_OuterNotEmpty;//outerrelation是否为空}HashJoinState;

HashJoinTable
Hash表数据结构

typedefstructHashJoinTableData{intnbuckets;/*内存中的hash桶数;#bucketsinthein-memoryhashtable*/intlog2_nbuckets;/*2的对数(nbuckets必须是2的幂);itslog2(nbucketsmustbeapowerof2)*/intnbuckets_original;/*首次hash时的桶数;#bucketswhenstartingthefirsthash*/intnbuckets_optimal;/*优化后的桶数(每个批次);optimal#buckets(perbatch)*/intlog2_nbuckets_optimal;/*2的对数;log2(nbuckets_optimal)*//*buckets[i]isheadoflistoftuplesini'thin-memorybucket*///bucket[i]是内存中第i个桶中的元组链表的headitemunion{/*unsharedarrayisper-batchstorage,asareallthetuples*///未共享数组是按批处理存储的，所有元组均如此structHashJoinTupleData**unshared;/*sharedarrayisper-queryDSAarea,asareallthetuples*///共享数组是每个查询的DSA区域，所有元组均如此dsa_pointer_atomic*shared;}buckets;boolkeepNulls;/*如不匹配则存储NULL元组,该值为T;truetostoreunmatchableNULLtuples*/boolskewEnabled;/*是否使用倾斜优化?;areweusingskewoptimization?*/HashSkewBucket**skewBucket;/*倾斜的hash表桶数;hashtableofskewbuckets*/intskewBucketLen;/*skewBucket数组大小;sizeofskewBucketarray(apowerof2!)*/intnSkewBuckets;/*活动的倾斜桶数;numberofactiveskewbuckets*/int*skewBucketNums;/*活动倾斜桶数组索引;arrayindexesofactiveskewbuckets*/intnbatch;/*批次数;numberofbatches*/intcurbatch;/*当前批次,第一轮为0;currentbatch#;0during1stpass*/intnbatch_original;/*在开始inner扫描时的批次;nbatchwhenwestartedinnerscan*/intnbatch_outstart;/*在开始outer扫描时的批次;nbatchwhenwestartedouterscan*/boolgrowEnabled;/*关闭nbatch增加的标记;flagtoshutoffnbatchincreases*/doubletotalTuples;/*从innerplan获得的元组数;#tuplesobtainedfrominnerplan*/doublepartialTuples;/*通过hashjoin获得的inner元组数;#tuplesobtainedfrominnerplanbyme*/doubleskewTuples;/*倾斜元组数;#tuplesinsertedintoskewtuples*//**Thesearraysareallocatedforthelifeofthehashjoin,butonlyif*nbatch>1.Afileisopenedonlywhenwefirstwriteatupleintoit*(otherwiseitspointerremainsNULL).Notethatthezero'tharray*elementsnevergetused,sincewewillprocessratherthandumpoutany*tuplesofbatchzero.*这些数组在散列连接的生命周期内分配，但仅当nbatch>1时分配。*只有当第一次将元组写入文件时，文件才会打开(否则它的指针将保持NULL)。*注意，第0个数组元素永远不会被使用，因为批次0的元组永远不会转储.*/BufFile**innerBatchFile;/*每个批次的inner虚拟临时文件缓存;bufferedvirtualtempfileperbatch*/BufFile**outerBatchFile;/*每个批次的outer虚拟临时文件缓存;bufferedvirtualtempfileperbatch*//**Infoaboutthedatatype-specifichashfunctionsforthedatatypesbeing*hashed.Thesearearraysofthesamelengthasthenumberofhashjoin*clauses(hashkeys).*有关正在散列的数据类型的特定于数据类型的散列函数的信息。*这些数组的长度与散列连接子句(散列键)的数量相同。*/FmgrInfo*outer_hashfunctions;/*outerhash函数FmgrInfo结构体;lookupdataforhashfunctions*/FmgrInfo*inner_hashfunctions;/*innerhash函数FmgrInfo结构体;lookupdataforhashfunctions*/bool*hashStrict;/*每个hash操作符是严格?iseachhashjoinoperatorstrict?*/SizespaceUsed;/*元组使用的当前内存空间大小;memoryspacecurrentlyusedbytuples*/SizespaceAllowed;/*空间使用上限;upperlimitforspaceused*/SizespacePeak;/*峰值的空间使用;peakspaceused*/SizespaceUsedSkew;/*倾斜哈希表的当前空间使用情况;skewhashtable'scurrentspaceusage*/SizespaceAllowedSkew;/*倾斜哈希表的使用上限;upperlimitforskewhashtable*/MemoryContexthashCxt;/*整个散列连接存储的上下文;contextforwhole-hash-joinstorage*/MemoryContextbatchCxt;/*该批次存储的上下文;contextforthis-batch-onlystorage*//*usedfordenseallocationoftuples(intolinkedchunks)*///用于密集分配元组(到链接块中)HashMemoryChunkchunks;/*整个批次使用一个链表;onelistforthewholebatch*//*SharedandprivatestateforParallelHash.*///并行hash使用的共享和私有状态HashMemoryChunkcurrent_chunk;/*后台进程的当前chunk;thisbackend'scurrentchunk*/dsa_area*area;/*用于分配内存的DSA区域;DSAareatoallocatememoryfrom*/ParallelHashJoinState*parallel_state;//并行执行状态ParallelHashJoinBatchAccessor*batches;//并行访问器dsa_pointercurrent_chunk_shared;//当前chunk的开始指针}HashJoinTableData;typedefstructHashJoinTableData*HashJoinTable;

HashJoinTupleData
Hash连接元组数据

/*----------------------------------------------------------------*hash-joinhashtablestructures**EachactivehashjoinhasaHashJoinTablecontrolblock,whichis*palloc'dintheexecutor'sper-querycontext.Allotherstorageneeded*forthehashjoiniskeptinprivatememorycontexts,twoforeachhashjoin.*Thismakesiteasyandfasttoreleasethestoragewhenwedon'tneedit*anymore.(Exception:dataassociatedwiththetempfileslivesinthe*per-querycontexttoo,sincewealwayscallbuffile.cinthatcontext.)*每个活动的hashjoin都有一个可散列的控制块，它在执行程序的每个查询上下文中都是通过palloc分配的。*hashjoin所需的所有其他存储都保存在私有内存上下文中，每个hashjoin有两个。*当不再需要它的时候，这使得释放它变得简单和快速。*(例外:与临时文件相关的数据也存在于每个查询上下文中，因为在这种情况下总是调用buffile.c。)**Thehashtablecontextsaremadechildrenoftheper-querycontext,ensuring*thattheywillbediscardedatendofstatementevenifthejoinis*abortedearlybyanerror.(Likewise,anytemporaryfileswemakewill*becleanedupbythevirtualfilemanagerineventofanerror.)*hashtable上下文是每个查询上下文的子上下文，确保在语句结束时丢弃它们，即使连接因错误而提前中止。*(同样，如果出现错误，虚拟文件管理器将清理创建的任何临时文件。)**Storagethatshouldlivethroughtheentirejoinisallocatedfromthe*"hashCxt",whilestoragethatisonlywantedforthecurrentbatchis*allocatedinthe"batchCxt".ByresettingthebatchCxtattheendof*eachbatch,wefreealltheper-batchstoragereliablyandwithouttedium.*通过整个连接的存储空间应从“hashCxt”分配，而只需要当前批处理的存储空间在“batchCxt”中分配。*通过在每个批处理结束时重置batchCxt，可以可靠地释放每个批处理的所有存储，而不会感到单调乏味。**Duringfirstscanofinnerrelation,wegetitstuplesfromexecutor.*Ifnbatch>1thentuplesthatdon'tbelonginfirstbatchgetsaved*intoinner-batchtempfiles.Thesamestatementsapplyforthe*firstscanoftheouterrelation,exceptwewritetuplestoouter-batch*tempfiles.Afterfinishingthefirstscan,wedothefollowingfor*eachremainingbatch:*1.Readtuplesfrominnerbatchfile,loadintohashbuckets.*2.Readtuplesfromouterbatchfile,matchtohashbucketsandoutput.*在内部关系的第一次扫描中，从执行者那里得到了它的元组。*如果nbatch>1，那么不属于第一批的元组将保存到批内临时文件中。*相同的语句适用于外关系的第一次扫描，但是我们将元组写入外部批处理临时文件。*完成第一次扫描后，我们对每批剩余的元组做如下处理:*1.从内部批处理文件读取元组，加载到散列桶中。*2.从外部批处理文件读取元组，匹配哈希桶和输出。**Itispossibletoincreasenbatchontheflyifthein-memoryhashtable*getstoobig.Thehash-value-to-batchcomputationisarrangedsothatthis*canonlycauseatupletogointoalaterbatchthanpreviouslythought,*neverintoanearlierbatch.Whenweincreasenbatch,werescanthehash*tableanddumpoutanytuplesthatarenowofalaterbatchtothecorrect*innerbatchfile.Subsequently,whilereadingeitherinnerorouterbatch*files,wemightfindtuplesthatnolongerbelongtothecurrentbatch;*ifso,wejustdumpthemouttothecorrectbatchfile.*如果内存中的哈希表太大，可以动态增加nbatch。*散列值到批处理的计算是这样安排的:*这只会导致元组进入比以前认为的更晚的批处理，而不会进入更早的批处理。*当增加nbatch时，重新扫描哈希表，并将现在属于后面批处理的任何元组转储到正确的内部批处理文件。*随后，在读取内部或外部批处理文件时，可能会发现不再属于当前批处理的元组;*如果是这样，只需将它们转储到正确的批处理文件即可。*----------------------------------------------------------------*//*theseareinnodes/execnodes.h:*//*typedefstructHashJoinTupleData*HashJoinTuple;*//*typedefstructHashJoinTableData*HashJoinTable;*/typedefstructHashJoinTupleData{/*linktonexttupleinsamebucket*///link同一个桶中的下一个元组union{structHashJoinTupleData*unshared;dsa_pointershared;}next;uint32hashvalue;/*元组的hash值;tuple'shashcode*//*Tupledata,inMinimalTupleformat,followsonaMAXALIGNboundary*/}HashJoinTupleData;#defineHJTUPLE_OVERHEADMAXALIGN(sizeof(HashJoinTupleData))#defineHJTUPLE_MINTUPLE(hjtup)\((MinimalTuple)((char*)(hjtup)+HJTUPLE_OVERHEAD))二、源码解读

ExecHashJoinOuterGetTuple
获取非并行模式下hashjoin的下一个外部元组:要么在第一次执行外部plan节点，要么从hashjoin批处理的临时文件中获取。

/*----------------------------------------------------------------------------------------------------HJ_NEED_NEW_OUTER阶段----------------------------------------------------------------------------------------------------*//**ExecHashJoinOuterGetTuple**getthenextoutertupleforaparalleloblivioushashjoin:eitherby*executingtheouterplannodeinthefirstpass,orfromthetemp*filesforthehashjoinbatches.*获取非并行模式下hashjoin的下一个外部元组:要么在第一次执行外部plan节点，要么从hashjoin批处理的临时文件中获取。**Returnsanullslotifnomoreoutertuples(withinthecurrentbatch).*如果没有更多外部元组(在当前批处理中)，则返回空slot。**Onsuccess,thetuple'shashvalueisstoredat*hashvalue---thisis*eitheroriginallycomputed,orre-readfromthetempfile.*如果成功，tuple的散列值存储在输入参数*hashvalue中——这是最初计算的，或者是从临时文件中重新读取的。*/staticTupleTableSlot*ExecHashJoinOuterGetTuple(PlanState*outerNode,//outer节点HashJoinState*hjstate,//HashJoin执行状态uint32*hashvalue)//Hash值{HashJoinTablehashtable=hjstate->hj_HashTable;//hash表intcurbatch=hashtable->curbatch;//当前批次TupleTableSlot*slot;//返回的slotif(curbatch==0)/*第一个批次;ifitisthefirstpass*/{/**Checktoseeiffirstoutertuplewasalreadyfetchedby*ExecHashJoin()andnotusedyet.*检查第一个外部元组是否已经由ExecHashJoin()函数获取且尚未使用。*/slot=hjstate->hj_FirstOuterTupleSlot;if(!TupIsNull(slot))hjstate->hj_FirstOuterTupleSlot=NULL;//重置slotelseslot=ExecProcNode(outerNode);//如为NULL,则获取slotwhile(!TupIsNull(slot))//slot不为NULL{/**Wehavetocomputethetuple'shashvalue.*计算hash值*/ExprContext*econtext=hjstate->js.ps.ps_ExprContext;//表达式计算上下文econtext->ecxt_outertuple=slot;//存储获取的slotif(ExecHashGetHashValue(hashtable,econtext,hjstate->hj_OuterHashKeys,true,/*outertuple*/HJ_FILL_OUTER(hjstate),hashvalue))//计算Hash值{/*rememberouterrelationisnotemptyforpossiblerescan*/hjstate->hj_OuterNotEmpty=true;//设置标记(outer不为空)returnslot;//返回匹配的slot}/**Thattuplecouldn'tmatchbecauseofaNULL,sodiscarditand*continuewiththenextone.*该元组无法匹配，丢弃它，继续下一个元组。*/slot=ExecProcNode(outerNode);//继续获取下一个}}elseif(curbatch<hashtable->nbatch)//不是第一个批次{BufFile*file=hashtable->outerBatchFile[curbatch];//获取缓冲的文件/**Inouter-joincases,wecouldgethereeventhoughthebatchfile*isempty.*在外连接的情况下，即使批处理文件是空的，也可以在这里进行处理。*/if(file==NULL)returnNULL;//如文件为NULL,则返回slot=ExecHashJoinGetSavedTuple(hjstate,file,hashvalue,hjstate->hj_OuterTupleSlot);//从文件中获取slotif(!TupIsNull(slot))returnslot;//非NULL,则返回}/*Endofthisbatch*///已完成,则返回NULLreturnNULL;}/**ExecHashGetHashValue*Computethehashvalueforatuple*ExecHashGetHashValue-计算元组的Hash值**Thetupletobetestedmustbeineitherecontext->ecxt_outertupleor*econtext->ecxt_innertuple.Varsinthehashkeysexpressionsshouldhave*varnoeitherOUTER_VARorINNER_VAR.*要测试的元组必须位于econtext->ecxt_outertuple或econtext->ecxt_innertuple中。*hashkeys表达式中的Vars应该具有varno，即OUTER_VAR或INNER_VAR。**Atrueresultmeansthetuple'shashvaluehasbeensuccessfullycomputed*andstoredat*hashvalue.Afalseresultmeansthetuplecannotmatch*becauseitcontainsanullattribute,andhenceitshouldbediscarded*immediately.(Ifkeep_nullsistruethenfalseisneverreturned.)*T意味着tuple的散列值已经成功计算并存储在*hashvalue参数中。*F意味着元组不能匹配，因为它包含null属性，因此应该立即丢弃它。*(如果keep_nulls为真，则永远不会返回F。)*/boolExecHashGetHashValue(HashJoinTablehashtable,//Hash表ExprContext*econtext,//上下文List*hashkeys,//Hash键值链表boolouter_tuple,//是否外表元组boolkeep_nulls,//是否保存NULLuint32*hashvalue)//返回的Hash值{uint32hashkey=0;//hash键FmgrInfo*hashfunctions;//hash函数ListCell*hk;//临时变量inti=0;MemoryContextoldContext;/**Weresettheevalcontexteachtimetoreclaimanymemoryleakedinthe*hashkeyexpressions.*我们每次重置eval上下文来回收hashkey表达式中分配的内存。*/ResetExprContext(econtext);//切换上下文oldContext=MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);if(outer_tuple)hashfunctions=hashtable->outer_hashfunctions;//外表元组elsehashfunctions=hashtable->inner_hashfunctions;//内表元组foreach(hk,hashkeys)//遍历Hash键值{ExprState*keyexpr=(ExprState*)lfirst(hk);//键值表达式Datumkeyval;boolisNull;/*rotatehashkeyleft1bitateachstep*///哈希键左移1位hashkey=(hashkey<<1)|((hashkey&0x80000000)?1:0);/**Getthejoinattributevalueofthetuple*获取元组的连接属性值*/keyval=ExecEvalExpr(keyexpr,econtext,&isNull);/**IftheattributeisNULL,andthejoinoperatorisstrict,then*thistuplecannotpassthejoinqualsowecanrejectit*immediately(unlesswe'rescanningtheoutsideofanouterjoin,in*whichcasewemustnotrejectit).Otherwiseweactlikethe*hashcodeofNULLiszero(thiswillsupportoperatorsthatactlike*ISNOTDISTINCT,thoughnotanymore-randombehavior).Wetreat*thehashsupportfunctionasstricteveniftheoperatorisnot.*如果属性为NULL，并且join操作符是严格的，那么这个元组不能传递连接条件joinqual，*因此可以立即拒绝它(除非正在扫描外连接的外表，在这种情况下不能拒绝它)。*否则，我们的行为就好像NULL的哈希码是零一样(这将支持ISNOTDISTINCT操作符，但不会有任何随机的情况出现)。*即使操作符不是严格的，也将哈希函数视为严格的。**Note:currently,allhashjoinableoperatorsmustbestrictsince*thehashindexAMassumesthat.However,ittakessolittleextra*codeheretoallownon-strictthatwemayaswelldoit.*注意:目前，所有哈希可连接操作符都必须严格，因为哈希索引AM假定如此。*但是，这里只需要很少的额外代码就可以实现非严格性，我们也可以这样做。*/if(isNull){//NULL值if(hashtable->hashStrict[i]&&!keep_nulls){MemoryContextSwitchTo(oldContext);//不保持NULL值,不匹配returnfalse;/*cannotmatch*/}/*else,leavehashkeyunmodified,equivalenttohashcode0*///否则的话,不修改hashkey,仍为0}else{//不为NULL/*Computethehashfunction*///计算hash值uint32hkey;hkey=DatumGetUInt32(FunctionCall1(&hashfunctions[i],keyval));hashkey^=hkey;}i++;//下一个键}//切换上下文MemoryContextSwitchTo(oldContext);//返回Hash键值*hashvalue=hashkey;returntrue;//成功获取}

ExecPrepHashTableForUnmatched
为ExecScanHashTableForUnmatched函数调用作准备

/**ExecPrepHashTableForUnmatched*setupforaseriesofExecScanHashTableForUnmatchedcalls*为ExecScanHashTableForUnmatched函数调用作准备*/voidExecPrepHashTableForUnmatched(HashJoinState*hjstate){/*----------*DuringthisscanweusetheHashJoinStatefieldsasfollows:**hj_CurBucketNo:nextregularbuckettoscan*hj_CurSkewBucketNo:nextskewbucket(anindexintoskewBucketNums)*hj_CurTuple:lasttuplereturned,orNULLtostartnextbucket*在这次扫描期间,我们使用HashJoinState结构体中的字段如下:*hj_CurBucketNo:下一个常规的bucket*hj_CurSkewBucketNo:下一个个倾斜的bucket*hj_CurTuple:最后返回的元组,或者为NULL(下一个bucket开始)*----------*/hjstate->hj_CurBucketNo=0;hjstate->hj_CurSkewBucketNo=0;hjstate->hj_CurTuple=NULL;}

ExecHashGetBucketAndBatch
确定哈希值的bucket号和批处理号

/**ExecHashGetBucketAndBatch*Determinethebucketnumberandbatchnumberforahashvalue*ExecHashGetBucketAndBatch*确定哈希值的bucket号和批处理号**Note:on-the-flyincreasesofnbatchmustnotchangethebucketnumber*foragivenhashcode(sincewedon'tmovetuplestodifferenthash*chains),andmustonlycausethebatchnumbertoremainthesameor*increase.Ouralgorithmis*bucketno=hashvalueMODnbuckets*batchno=(hashvalueDIVnbuckets)MODnbatch*wherenbucketsandnbatcharebothexpectedtobepowersof2,sowecan*dothecomputationsbyshiftingandmasking.(Thisassumesthatallhash*functionsaregoodaboutrandomizingalltheiroutputbits,elseweare*likelytohaveveryskewedbucketorbatchoccupancy.)*注意:nbatch的动态增加不能更改给定哈希码的桶号(因为我们不将元组移动到不同的哈希链)，*并且只能使批号保持不变或增加。我们的算法是:*bucketno=hashvalueMODnbuckets*batchno=(hashvalueDIVnbuckets)MODnbatch*这里nbucket和nbatch都是2的幂，所以我们可以通过移动和屏蔽来进行计算。*(这假定所有哈希函数都能很好地随机化它们的所有输出位，否则很可能会出现非常倾斜的桶或批处理占用。)**nbucketsandlog2_nbucketsmaychangewhilenbatch==1becauseofdynamic*bucketcountgrowth.Oncewestartbatching,thevalueisfixedanddoes*notchangeoverthecourseofthejoin(makingitpossibletocomputebatch*numberthewaywedohere).*当nbatch==1时，由于动态bucket计数的增长，nbucket和log2_nbucket可能会发生变化。*一旦开始批处理，这个值就固定了，并且在连接过程中不会改变(这使得我们可以像这里那样计算批号)。**nbatchisalwaysapowerof2;weincreaseitonlybydoublingit.This*effectivelyaddsonemorebittothetopofthebatchno.*nbatch总是2的幂;我们只是通过x2来调整。这相当于为批号的头部增加了一位。*/voidExecHashGetBucketAndBatch(HashJoinTablehashtable,uint32hashvalue,int*bucketno,int*batchno){uint32nbuckets=(uint32)hashtable->nbuckets;//桶数uint32nbatch=(uint32)hashtable->nbatch;//批次号if(nbatch>1)//批次>1{/*wecandoMODbymasking,DIVbyshifting*///我们可以通过屏蔽来实现MOD，通过移动来实现DIV*bucketno=hashvalue&(nbuckets-1);//nbuckets-1后相当于N个1*batchno=(hashvalue>>hashtable->log2_nbuckets)&(nbatch-1);}else{*bucketno=hashvalue&(nbuckets-1);//只有一个批次,简单处理即可*batchno=0;}}

ExecHashGetSkewBucket
返回这个哈希值的倾斜桶的索引，如果哈希值与任何活动的倾斜桶没有关联，则返回INVALID_SKEW_BUCKET_NO。

/**ExecHashGetSkewBucket**Returnstheindexoftheskewbucketforthishashvalue,*orINVALID_SKEW_BUCKET_NOifthehashvalueisnot*associatedwithanyactiveskewbucket.*返回这个哈希值的倾斜桶的索引，如果哈希值与任何活动的倾斜桶没有关联，则返回INVALID_SKEW_BUCKET_NO。*/intExecHashGetSkewBucket(HashJoinTablehashtable,uint32hashvalue){intbucket;/**AlwaysreturnINVALID_SKEW_BUCKET_NOifnotdoingskewoptimization(in*particular,thishappensaftertheinitialbatchisdone).*如果不进行倾斜优化(特别是在初始批处理完成之后)，则返回INVALID_SKEW_BUCKET_NO。*/if(!hashtable->skewEnabled)returnINVALID_SKEW_BUCKET_NO;/**SinceskewBucketLenisapowerof2,wecandoamodulobyANDing.'*由于skewBucketLen是2的幂，可以通过AND操作来做一个模。*/bucket=hashvalue&(hashtable->skewBucketLen-1);/**Whilewehavenothitaholeinthehashtableandhavenothitthe*desiredbucket,wehavecollidedwithsomeotherhashvalue,sotrythe*nextbucketlocation.*虽然我们没有在哈希表中找到一个hole，也没有找到所需的bucket，*但是与其他一些哈希值发生了冲突，所以尝试下一个bucket位置。*/while(hashtable->skewBucket[bucket]!=NULL&&hashtable->skewBucket[bucket]->hashvalue!=hashvalue)bucket=(bucket+1)&(hashtable->skewBucketLen-1);/**Foundthedesiredbucket?*找到了bucket,返回*/if(hashtable->skewBucket[bucket]!=NULL)returnbucket;/**Theremustnotbeanyhashtableentryforthishashvalue.*///否则返回INVALID_SKEW_BUCKET_NOreturnINVALID_SKEW_BUCKET_NO;}

ExecHashJoinSaveTuple
在批处理文件中保存元组.每个元组在文件中记录的是它的散列值，然后是最小化格式的元组。

/**ExecHashJoinSaveTuple*saveatupletoabatchfile.*在批处理文件中保存元组**Thedatarecordedinthefileforeachtupleisitshashvalue,*thenthetupleinMinimalTupleformat.*每个元组在文件中记录的是它的散列值，然后是最小化格式的元组。**Note:itisimportantalwaystocallthisintheregularexecutor*context,notinashorter-livedcontext;elsethetempfilebuffers*willgetmessedup.*注意:在常规执行程序上下文中调用它总是很重要的，而不是在较短的生命周期中调用它;*否则临时文件缓冲区就会出现混乱。*/voidExecHashJoinSaveTuple(MinimalTupletuple,uint32hashvalue,BufFile**fileptr){BufFile*file=*fileptr;//文件指针size_twritten;//写入大小if(file==NULL){/*Firstwritetothisbatchfile,soopenit.*///文件指针为NULL,首次写入,则打开批处理文件file=BufFileCreateTemp(false);*fileptr=file;}//首先写入hash值,返回写入的大小written=BufFileWrite(file,(void*)&hashvalue,sizeof(uint32));if(written!=sizeof(uint32))//写入有误,报错ereport(ERROR,(errcode_for_file_access(),errmsg("couldnotwritetohash-jointemporaryfile:%m")));//写入tuplewritten=BufFileWrite(file,(void*)tuple,tuple->t_len);if(written!=tuple->t_len)//写入有误,报错ereport(ERROR,(errcode_for_file_access(),errmsg("couldnotwritetohash-jointemporaryfile:%m")));}

ExecFetchSlotMinimalTuple
以最小化物理元组的格式提取slot的数据

/*--------------------------------*ExecFetchSlotMinimalTuple*Fetchtheslot'sminimalphysicaltuple.*以最小化物理元组的格式提取slot的数据.**Ifthegiventupletableslotcanholdaminimaltuple,indicatedbya*non-NULLget_minimal_tuplecallback,thefunctionreturnstheminimal*tuplereturnedbythatcallback.Itassumesthattheminimaltuple*returnedbythecallbackis"owned"bythesloti.e.theslotis*responsibleforfreeingthememoryconsumedbythetuple.Henceitsets**shouldFreetofalse,indicatingthatthecallershouldnotfreethe*memoryconsumedbytheminimaltuple.Inthiscasethereturnedminimal*tupleshouldbeconsideredasread-only.*如果给定的元组tableslot可以保存由non-NULLget_minimal_tuple回调函数指示的最小元组，*则函数将返回该回调函数返回的最小元组。*它假定回调函数返回的最小元组由slot“拥有”，即slot负责释放元组所消耗的内存。*因此，它将*shouldFree设置为false，表示调用方不应该释放内存。*在这种情况下，返回的最小元组应该被认为是只读的。**Ifthatcallbackisnotsupported,itcallscopy_minimal_tuplecallback*whichisexpectedtoreturnacopyofminimaltuplerepresntingthe*contentsoftheslot.Inthiscase*shouldFreeissettotrue,*indicatingthecallerthatitshouldfreethememoryconsumedbythe*minimaltuple.Inthiscasethereturnedminimaltuplemaybewritten*up.*如果不支持该回调函数，则调用copy_minimal_tuple回调函数，*该回调将返回一个表示slot内容的最小元组副本。**shouldFree被设置为true，这表示调用者应该释放内存。*在这种情况下，可以写入返回的最小元组。*--------------------------------*/MinimalTupleExecFetchSlotMinimalTuple(TupleTableSlot*slot,bool*shouldFree){/**sanitychecks*安全检查*/Assert(slot!=NULL);Assert(!TTS_EMPTY(slot));if(slot->tts_ops->get_minimal_tuple)//调用slot->tts_ops->get_minimal_tuple{//调用成功,则该元组为只读,由slot负责释放if(shouldFree)*shouldFree=false;returnslot->tts_ops->get_minimal_tuple(slot);}else{//调用不成功,设置为true,由调用方释放if(shouldFree)*shouldFree=true;returnslot->tts_ops->copy_minimal_tuple(slot);//调用copy_minimal_tuple函数}}三、跟踪分析

测试脚本如下

testdb=#setenable_nestloop=false;SETtestdb=#setenable_mergejoin=false;SETtestdb=#explainverboseselectdw.*,grjf.grbh,grjf.xm,grjf.ny,grjf.jetestdb-#fromt_dwxxdw,lateral(selectgr.grbh,gr.xm,jf.ny,jf.jetestdb(#fromt_grxxgrinnerjoint_jfxxjftestdb(#ongr.dwbh=dw.dwbhtestdb(#andgr.grbh=jf.grbh)grjftestdb-#orderbydw.dwbh;QUERYPLAN-----------------------------------------------------------------------------------------------Sort(cost=14828.83..15078.46rows=99850width=47)Output:dw.dwmc,dw.dwbh,dw.dwdz,gr.grbh,gr.xm,jf.ny,jf.jeSortKey:dw.dwbh->HashJoin(cost=3176.00..6537.55rows=99850width=47)Output:dw.dwmc,dw.dwbh,dw.dwdz,gr.grbh,gr.xm,jf.ny,jf.jeHashCond:((gr.grbh)::text=(jf.grbh)::text)->HashJoin(cost=289.00..2277.61rows=99850width=32)Output:dw.dwmc,dw.dwbh,dw.dwdz,gr.grbh,gr.xmInnerUnique:trueHashCond:((gr.dwbh)::text=(dw.dwbh)::text)->SeqScanonpublic.t_grxxgr(cost=0.00..1726.00rows=100000width=16)Output:gr.dwbh,gr.grbh,gr.xm,gr.xb,gr.nl->Hash(cost=164.00..164.00rows=10000width=20)Output:dw.dwmc,dw.dwbh,dw.dwdz->SeqScanonpublic.t_dwxxdw(cost=0.00..164.00rows=10000width=20)Output:dw.dwmc,dw.dwbh,dw.dwdz->Hash(cost=1637.00..1637.00rows=100000width=20)Output:jf.ny,jf.je,jf.grbh->SeqScanonpublic.t_jfxxjf(cost=0.00..1637.00rows=100000width=20)Output:jf.ny,jf.je,jf.grbh(20rows)

启动gdb,设置断点

(gdb)bExecHashJoinOuterGetTupleBreakpoint1at0x702edc:filenodeHashjoin.c,line807.(gdb)bExecHashGetHashValueBreakpoint2at0x6ff060:filenodeHash.c,line1778.(gdb)bExecHashGetBucketAndBatchBreakpoint3at0x6ff1df:filenodeHash.c,line1880.(gdb)bExecHashJoinSaveTupleBreakpoint4at0x703973:filenodeHashjoin.c,line1214.(gdb)

ExecHashGetHashValue
ExecHashGetHashValue->进入函数ExecHashGetHashValue

(gdb)cContinuing.Breakpoint2,ExecHashGetHashValue(hashtable=0x14acde8,econtext=0x149c3d0,hashkeys=0x14a8e40,outer_tuple=false,keep_nulls=false,hashvalue=0x7ffc7eba5c20)atnodeHash.c:17781778uint32hashkey=0;

ExecHashGetHashValue->初始化,切换内存上下文

1778uint32hashkey=0;(gdb)n1781inti=0;(gdb)1788ResetExprContext(econtext);(gdb)1790oldContext=MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);(gdb)1792if(outer_tuple)

ExecHashGetHashValue->inner hash函数

1792if(outer_tuple)(gdb)1795hashfunctions=hashtable->inner_hashfunctions;

ExecHashGetHashValue->获取hahs键信息
1号RTE(varnoold = 1,即t_dwxx)的dwbh字段(varattno = 2)

(gdb)1797foreach(hk,hashkeys)(gdb)1799ExprState*keyexpr=(ExprState*)lfirst(hk);(gdb)1804hashkey=(hashkey<<1)|((hashkey&0x80000000)?1:0);(gdb)p*keyexpr$1={tag={type=T_ExprState},flags=2'\002',resnull=false,resvalue=0,resultslot=0x0,steps=0x14a8a00,evalfunc=0x6d1a6e<ExecInterpExprStillValid>,expr=0x1498fc0,evalfunc_private=0x6d1e97<ExecJustInnerVar>,steps_len=3,steps_alloc=16,parent=0x149b738,ext_params=0x0,innermost_caseval=0x0,innermost_casenull=0x0,innermost_domainval=0x0,innermost_domainnull=0x0}(gdb)p*(RelabelType*)keyexpr->expr$3={xpr={type=T_RelabelType},arg=0x1499018,resulttype=25,resulttypmod=-1,resultcollid=100,relabelformat=COERCE_IMPLICIT_CAST,location=-1}(gdb)p*((RelabelType*)keyexpr->expr)->arg$4={type=T_Var}(gdb)p*(Var*)((RelabelType*)keyexpr->expr)->arg$5={xpr={type=T_Var},varno=65000,varattno=2,vartype=1043,vartypmod=24,varcollid=100,varlevelsup=0,varnoold=1,varoattno=2,location=218}(gdb)

ExecHashGetHashValue->获取hash值,解析表达式

(gdb)n1809keyval=ExecEvalExpr(keyexpr,econtext,&isNull);(gdb)1824if(isNull)(gdb)phashkey$6=0(gdb)pkeyval$7=140460362257270(gdb)

ExecHashGetHashValue->返回值不为NULL

(gdb)pisNull$8=false(gdb)n1838hkey=DatumGetUInt32(FunctionCall1(&hashfunctions[i],keyval));

ExecHashGetHashValue->计算Hash值

(gdb)n1839hashkey^=hkey;(gdb)phkey$9=3663833849(gdb)phashkey$10=0(gdb)n1842i++;(gdb)phashkey$11=3663833849(gdb)

ExecHashGetHashValue->返回计算结果

(gdb)n1797foreach(hk,hashkeys)(gdb)1845MemoryContextSwitchTo(oldContext);(gdb)1847*hashvalue=hashkey;(gdb)1848returntrue;(gdb)1849}

ExecHashGetBucketAndBatch
ExecHashGetBucketAndBatch->进入ExecHashGetBucketAndBatch

(gdb)cContinuing.Breakpoint3,ExecHashGetBucketAndBatch(hashtable=0x14acde8,hashvalue=3663833849,bucketno=0x7ffc7eba5bdc,batchno=0x7ffc7eba5bd8)atnodeHash.c:18801880uint32nbuckets=(uint32)hashtable->nbuckets;

ExecHashGetBucketAndBatch->获取bucket数和批次数

1880uint32nbuckets=(uint32)hashtable->nbuckets;(gdb)n1881uint32nbatch=(uint32)hashtable->nbatch;(gdb)1883if(nbatch>1)(gdb)pnbuckets$12=16384(gdb)pnbatch$13=1(gdb)

ExecHashGetBucketAndBatch->计算桶号和批次号(只有一个批次,设置为0)

(gdb)n1891*bucketno=hashvalue&(nbuckets-1);(gdb)1892*batchno=0;(gdb)1894}(gdb)pbucketno$14=(int*)0x7ffc7eba5bdc(gdb)p*bucketno$15=11001(gdb)

ExecHashJoinOuterGetTuple
ExecHashJoinOuterGetTuple->进入ExecHashJoinOuterGetTuple函数

(gdb)infobreakNumTypeDispEnbAddressWhat1breakpointkeepy0x0000000000702edcinExecHashJoinOuterGetTupleatnodeHashjoin.c:8072breakpointkeepy0x00000000006ff060inExecHashGetHashValueatnodeHash.c:1778breakpointalreadyhit4times3breakpointkeepy0x00000000006ff1dfinExecHashGetBucketAndBatchatnodeHash.c:1880breakpointalreadyhit4times4breakpointkeepy0x0000000000703973inExecHashJoinSaveTupleatnodeHashjoin.c:1214(gdb)del2(gdb)del3(gdb)cContinuing.Breakpoint1,ExecHashJoinOuterGetTuple(outerNode=0x149ba10,hjstate=0x149b738,hashvalue=0x7ffc7eba5ccc)atnodeHashjoin.c:807807HashJoinTablehashtable=hjstate->hj_HashTable;(gdb)

ExecHashJoinOuterGetTuple->查看输入参数
outerNode:outer relation为顺序扫描得到的relation(对t_jfxx进行顺序扫描)
hjstate:Hash Join执行状态
hashvalue:Hash值

(gdb)p*outerNode$16={type=T_SeqScanState,plan=0x1494d10,state=0x149b0f8,ExecProcNode=0x71578d<ExecSeqScan>,ExecProcNodeReal=0x71578d<ExecSeqScan>,instrument=0x0,worker_instrument=0x0,worker_jit_instrument=0x0,qual=0x0,lefttree=0x0,righttree=0x0,initPlan=0x0,subPlan=0x0,chgParam=0x0,ps_ResultTupleSlot=0x149c178,ps_ExprContext=0x149bb28,ps_ProjInfo=0x0,scandesc=0x7fbfa69a8308}(gdb)p*hjstate$17={js={ps={type=T_HashJoinState,plan=0x1496d18,state=0x149b0f8,ExecProcNode=0x70291d<ExecHashJoin>,ExecProcNodeReal=0x70291d<ExecHashJoin>,instrument=0x0,worker_instrument=0x0,worker_jit_instrument=0x0,qual=0x0,lefttree=0x149ba10,righttree=0x149c2b8,initPlan=0x0,subPlan=0x0,chgParam=0x0,ps_ResultTupleSlot=0x14a7498,ps_ExprContext=0x149b950,ps_ProjInfo=0x149cef0,scandesc=0x0},jointype=JOIN_INNER,single_match=true,joinqual=0x0},hashclauses=0x14a7b30,hj_OuterHashKeys=0x14a8930,hj_InnerHashKeys=0x14a8e40,hj_HashOperators=0x14a8ea0,hj_HashTable=0x14acde8,hj_CurHashValue=0,hj_CurBucketNo=0,hj_CurSkewBucketNo=-1,hj_CurTuple=0x0,hj_OuterTupleSlot=0x14a79f0,hj_HashTupleSlot=0x149cc18,hj_NullOuterTupleSlot=0x0,hj_NullInnerTupleSlot=0x0,hj_FirstOuterTupleSlot=0x149bbe8,hj_JoinState=2,hj_MatchedOuter=false,hj_OuterNotEmpty=false}(gdb)p*hashvalue$18=32703(gdb)

ExecHashJoinOuterGetTuple->只有一个批次,批次号为0

(gdb)n808intcurbatch=hashtable->curbatch;(gdb)811if(curbatch==0)/*ifitisthefirstpass*/(gdb)pcurbatch$20=0

ExecHashJoinOuterGetTuple->获取首个outer tuple slot(不为NULL),重置hjstate->hj_FirstOuterTupleSlot为NULL

(gdb)n817slot=hjstate->hj_FirstOuterTupleSlot;(gdb)818if(!TupIsNull(slot))(gdb)p*slot$21={type=T_TupleTableSlot,tts_isempty=false,tts_shouldFree=false,tts_shouldFreeMin=false,tts_slow=false,tts_tuple=0x14ac200,tts_tupleDescriptor=0x7fbfa69a8308,tts_mcxt=0x149afe0,tts_buffer=345,tts_nvalid=0,tts_values=0x149bc48,tts_isnull=0x149bc70,tts_mintuple=0x0,tts_minhdr={t_len=0,t_self={ip_blkid={bi_hi=0,bi_lo=0},ip_posid=0},t_tableOid=0,t_data=0x0},tts_off=0,tts_fixedTupleDescriptor=true}(gdb)(gdb)n819hjstate->hj_FirstOuterTupleSlot=NULL;(gdb)

ExecHashJoinOuterGetTuple->循环获取,找到匹配的slot

(gdb)823while(!TupIsNull(slot))(gdb)n828ExprContext*econtext=hjstate->js.ps.ps_ExprContext;(gdb)

ExecHashJoinOuterGetTuple->成功匹配,返回slot

(gdb)n830econtext->ecxt_outertuple=slot;(gdb)834HJ_FILL_OUTER(hjstate),(gdb)831if(ExecHashGetHashValue(hashtable,econtext,(gdb)838hjstate->hj_OuterNotEmpty=true;(gdb)840returnslot;(gdb)p*slot$22={type=T_TupleTableSlot,tts_isempty=false,tts_shouldFree=false,tts_shouldFreeMin=false,tts_slow=true,tts_tuple=0x14ac200,tts_tupleDescriptor=0x7fbfa69a8308,tts_mcxt=0x149afe0,tts_buffer=345,tts_nvalid=1,tts_values=0x149bc48,tts_isnull=0x149bc70,tts_mintuple=0x0,tts_minhdr={t_len=0,t_self={ip_blkid={bi_hi=0,bi_lo=0},ip_posid=0},t_tableOid=0,t_data=0x0},tts_off=2,tts_fixedTupleDescriptor=true}(gdb)

到此，关于“PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注亿速云网站，小编会继续努力为大家带来更多实用的文章！