本节介绍了PostgreSQL含有NOT IN查询语句实现ExecMaterial函数中部分依赖的函数。

一、数据结构

SubPlanState
子计划运行期状态

/* ---------------- * SubPlanState node * ---------------- */typedef struct SubPlanState{ NodeTag type; SubPlan *subplan; /* expression plan node */ struct PlanState *planstate; /* subselect plan's state tree */ struct PlanState *parent; /* parent plan node's state tree */ ExprState *testexpr; /* 组合表达式状态;state of combining expression */ List *args; /* 参数表达式状态;states of argument expression(s) */ HeapTuple curTuple; /* subplan最近的元组;copy of most recent tuple from subplan */ Datum curArray; /* most recent array from ARRAY() subplan */ /* these are used when hashing the subselect's output: */ TupleDesc descRight; /* 投影后的子查询描述符;subselect desc after projection */ ProjectionInfo *projLeft; /* for projecting lefthand exprs */ ProjectionInfo *projRight; /* for projecting subselect output */ TupleHashTable hashtable; /* hash table for no-nulls subselect rows */ TupleHashTable hashnulls; /* hash table for rows with null(s) */ bool havehashrows; /* true if hashtable is not empty */ bool havenullrows; /* true if hashnulls is not empty */ MemoryContext hashtablecxt; /* memory context containing hash tables */ MemoryContext hashtempcxt; /* temp memory context for hash tables */ ExprContext *innerecontext; /* econtext for computing inner tuples */ AttrNumber *keyColIdx; /* control data for hash tables */ Oid *tab_eq_funcoids; /* equality func oids for table * datatype(s) */ Oid *tab_collations; /* collations for hash and comparison */ FmgrInfo *tab_hash_funcs; /* hash functions for table datatype(s) */ FmgrInfo *tab_eq_funcs; /* equality functions for table datatype(s) */ FmgrInfo *lhs_hash_funcs; /* hash functions for lefthand datatype(s) */ FmgrInfo *cur_eq_funcs; /* equality functions for LHS vs. table */ ExprState *cur_eq_comp; /* equality comparator for LHS vs. table */} SubPlanState;

SubPlan
子查询计划

/* * SubPlan - executable expression node for a subplan (sub-SELECT) * * The planner replaces SubLink nodes in expression trees with SubPlan * nodes after it has finished planning the subquery. SubPlan references * a sub-plantree stored in the subplans list of the toplevel PlannedStmt. * (We avoid a direct link to make it easier to copy expression trees * without causing multiple processing of the subplan.) * 查询规划器在完成子查询的规划后使用SubPlan节点替换表达式树中的SubLink节点。 * SubPlan引用了存储在高层PlannedStmt中的subplans链表中的sub-plantree。 * (避免使用直接链接,从而使得拷贝表达式树相对比较简单) * * In an ordinary subplan, testexpr points to an executable expression * (OpExpr, an AND/OR tree of OpExprs, or RowCompareExpr) for the combining * operator(s); the left-hand arguments are the original lefthand expressions, * and the right-hand arguments are PARAM_EXEC Param nodes representing the * outputs of the sub-select. (NOTE: runtime coercion functions may be * inserted as well.) This is just the same expression tree as testexpr in * the original SubLink node, but the PARAM_SUBLINK nodes are replaced by * suitably numbered PARAM_EXEC nodes. * 常规情况下,testexpr指向用于组合操作的可执行表达式(OpExpr、OpExprs的AND/OR树或者RowCompareExpr); * 左参数是原始的左表达式,右参数是PARAM_EXEC参数节点用以表示子查询的输出。 * 与原始SubLink节点的testexpr具有相同的表达式树,但PARAM_SUBLINK节点则使用合适的已编号PARAM_EXEC节点替代。 * * If the sub-select becomes an initplan rather than a subplan, the executable * expression is part of the outer plan's expression tree (and the SubPlan * node itself is not, but rather is found in the outer plan's initPlan * list). In this case testexpr is NULL to avoid duplication. * 如果子查询成了initplan而不是subplan,可执行的表达式是外层plan表达式树的一部分。 * 这种情况下,testexpr为NULL以避免重复。 * * The planner also derives lists of the values that need to be passed into * and out of the subplan. Input values are represented as a list "args" of * expressions to be evaluated in the outer-query context (currently these * args are always just Vars, but in principle they could be any expression). * The values are assigned to the global PARAM_EXEC params indexed by parParam * (the parParam and args lists must have the same ordering). setParam is a * list of the PARAM_EXEC params that are computed by the sub-select, if it * is an initplan; they are listed in order by sub-select output column * position. (parParam and setParam are integer Lists, not Bitmapsets, * because their ordering is significant.) * 规划器还派生了需要传入和传出子计划的值的链表。 * 输入值标识位表达式的“args”链表,在外层查询上下文中进行解析。 * (这些args通常是Vars,但原则上它们可以是任意表达式) * 这些值以parParam为索引给全局PARAM_EXEC参数赋值。 * setParam是PARAM_EXEC参数链表,通过子查询(如为initplan)计算所得。 * 它们按子查询输出列的位置进行排序组织为链表形式。 * (parParam和setParam是整型链表,而不是Bitmapsets链表) * * Also, the planner computes startup and per-call costs for use of the * SubPlan. Note that these include the cost of the subquery proper, * evaluation of the testexpr if any, and any hashtable management overhead. * 同时,规划器计算SubPlan启动和每次调用的成本。注意:包括子查询正常解析testexpr的成本以及哈希表管理成本。 */typedef struct SubPlan{ Expr xpr;//表达式 /* Fields copied from original SubLink: */ //从SubLink中拷贝而来 SubLinkType subLinkType; /* see above */ /* The combining operators, transformed to an executable expression: */ //组合操作符,转换为可执行的表达式 Node *testexpr; /* OpExpr or RowCompareExpr expression tree */ List *paramIds; /* 参数IDs;IDs of Params embedded in the above */ /* Identification of the Plan tree to use: */ //Plan tree标识 int plan_id; /* Index (from 1) in PlannedStmt.subplans */ /* Identification of the SubPlan for EXPLAIN and debugging purposes: */ //EXPLAIN和debug目的的SubPlan标识 char *plan_name; /* A name assigned during planning */ /* Extra data useful for determining subplan's output type: */ //用于确定subplan输出类型的额外信息 Oid firstColType; /* subplan结果的第一个列类型;Type of first column of subplan result */ int32 firstColTypmod; /* 第一列的Typmod;Typmod of first column of subplan result */ Oid firstColCollation; /* 第一列的Collation;Collation of first column of subplan * result */ /* Information about execution strategy: */ //执行阶段的相关信息 bool useHashTable; /* 是否使用哈希表存储子查询输出;true to store subselect output in a hash * table (implies we are doing "IN") */ bool unknownEqFalse; /* 如OK为T,如为未知则为F;快速处理null值;true if it's okay to return FALSE when the * spec result is UNKNOWN; this allows much * simpler handling of null values */ bool parallel_safe; /* 是否并行安全?is the subplan parallel-safe? */ /* Note: parallel_safe does not consider contents of testexpr or args */ /* Information for passing params into and out of the subselect: */ //用于给子查询传入和传出参数的信息 /* setParam and parParam are lists of integers (param IDs) */ //setParam和parParam是整型链表(param IDs) List *setParam; /* initplan subqueries have to set these * Params for parent plan */ List *parParam; /* indices of input Params from parent plan */ List *args; /* 以parParam值进行传递的表达式;exprs to pass as parParam values */ /* Estimated execution costs: */ //估算执行成本 Cost startup_cost; /* one-time setup cost */ Cost per_call_cost; /* cost for each subplan evaluation */} SubPlan;

SubLinkType
SubLink类型

/* * SubLink * * A SubLink represents a subselect appearing in an expression, and in some * cases also the combining operator(s) just above it. The subLinkType * indicates the form of the expression represented: * EXISTS_SUBLINK EXISTS(SELECT ...) * ALL_SUBLINK (lefthand) op ALL (SELECT ...) * ANY_SUBLINK (lefthand) op ANY (SELECT ...) * ROWCOMPARE_SUBLINK (lefthand) op (SELECT ...) * EXPR_SUBLINK (SELECT with single targetlist item ...) * MULTIEXPR_SUBLINK (SELECT with multiple targetlist items ...) * ARRAY_SUBLINK ARRAY(SELECT with single targetlist item ...) * CTE_SUBLINK WITH query (never actually part of an expression) * 我们使用SubLink表示在表达式中出现的子查询,在某些情况下组合操作符会出现在SubLink之上。 * subLinkType表示表达式的形式: * EXISTS_SUBLINK EXISTS(SELECT ...) * ALL_SUBLINK (lefthand) op ALL (SELECT ...) * ANY_SUBLINK (lefthand) op ANY (SELECT ...) * ROWCOMPARE_SUBLINK (lefthand) op (SELECT ...) * EXPR_SUBLINK (SELECT with single targetlist item ...) * MULTIEXPR_SUBLINK (SELECT with multiple targetlist items ...) * ARRAY_SUBLINK ARRAY(SELECT with single targetlist item ...) * CTE_SUBLINK WITH query (never actually part of an expression) * * For ALL, ANY, and ROWCOMPARE, the lefthand is a list of expressions of the * same length as the subselect's targetlist. ROWCOMPARE will *always* have * a list with more than one entry; if the subselect has just one target * then the parser will create an EXPR_SUBLINK instead (and any operator * above the subselect will be represented separately). * ROWCOMPARE, EXPR, and MULTIEXPR require the subselect to deliver at most * one row (if it returns no rows, the result is NULL). * ALL, ANY, and ROWCOMPARE require the combining operators to deliver boolean * results. ALL and ANY combine the per-row results using AND and OR * semantics respectively. * ARRAY requires just one target column, and creates an array of the target * column's type using any number of rows resulting from the subselect. * 对于ALL,ANY和ROWCOMPARE,左操作符是与子查询目标链表长度一致的表达式链表。 * ROWCOMPARE通常有超过一个条目的链表;如果子查询刚好只有一个目标列,那么解析器会创建EXPR_SUBLINK * (同时所有在子查询之上的操作符会单独表示) * ROWCOMPARE, EXPR, 和MULTIEXPR要求子查询至少输出一行(如返回0行,则结果为NULL)。 * ALL,ANY和ROWCOMPARE要求组合操作符输出布尔型结果。 * ALL/ANY使用AND/OR语义来组合每一行的结果。 * * SubLink is classed as an Expr node, but it is not actually executable; * it must be replaced in the expression tree by a SubPlan node during * planning. * SubLink归类为Expr节点,但实际上并不是可执行的,必须在计划阶段通过SubPlan替代。 * * NOTE: in the raw output of gram.y, testexpr contains just the raw form * of the lefthand expression (if any), and operName is the String name of * the combining operator. Also, subselect is a raw parsetree. During parse * analysis, the parser transforms testexpr into a complete boolean expression * that compares the lefthand value(s) to PARAM_SUBLINK nodes representing the * output columns of the subselect. And subselect is transformed to a Query. * This is the representation seen in saved rules and in the rewriter. * 注意:在gram.y的裸输出中,testexpr只包含左表达式的裸形式,operName是组合操作符的字符串名称。 * 同时,子查询是裸parsetree。在解析分析期间, * 解析器转换testexpr为完整的布尔表达式用于比较左操作符值与PARAM_SUBLINK节点所代表的子查询输出列值。 * 子查询会转换为Query结构体。 * 在已存储的规则和重写时可见的表示形式。 * * In EXISTS, EXPR, MULTIEXPR, and ARRAY SubLinks, testexpr and operName * are unused and are always null. * 在EXISTS/EXPR/MULTEXPR/ARRAY SubLinks中,testexpr和operName不再使用通常是NULL值。 * * subLinkId is currently used only for MULTIEXPR SubLinks, and is zero in * other SubLinks. This number identifies different multiple-assignment * subqueries within an UPDATE statement's SET list. It is unique only * within a particular targetlist. The output column(s) of the MULTIEXPR * are referenced by PARAM_MULTIEXPR Params appearing elsewhere in the tlist. * subLinkId当前只用于MULTIEXPR,在其他SubLinks中取值为0. * 该数字标识了在UPDATE语句SET链表中不同的多个赋值子查询。 * 只有在特定的targetlist内是唯一的。 * 出现在tlist其他地方的PARAM_MULTIEXPR参数依赖于MULTIEXPR的输出列。 * * The CTE_SUBLINK case never occurs in actual SubLink nodes, but it is used * in SubPlans generated for WITH subqueries. * CTE_SUBLINK不会出现在实际的SubLink节点中,但用于WITH子查询所产生的SubPlans中。 */typedef enum SubLinkType{ EXISTS_SUBLINK, ALL_SUBLINK, ANY_SUBLINK, ROWCOMPARE_SUBLINK, EXPR_SUBLINK, MULTIEXPR_SUBLINK, ARRAY_SUBLINK, CTE_SUBLINK /* 仅用于SubPlans中;for SubPlans only */} SubLinkType;

SubLink
SubLink结构体

typedef struct SubLink{ Expr xpr; SubLinkType subLinkType; /* see above */ int subLinkId; /* ID (1..n); 0 if not MULTIEXPR */ Node *testexpr; /* outer-query test for ALL/ANY/ROWCOMPARE */ List *operName; /* originally specified operator name */ Node *subselect; /* subselect as Query* or raw parsetree */ int location; /* token location, or -1 if unknown */} SubLink;

MaterialState
Material状态

/* ---------------- * MaterialState information * * materialize nodes are used to materialize the results * of a subplan into a temporary file. * materialize节点用于物化subplan的结果为临时文件。 * * ss.ss_ScanTupleSlot refers to output of underlying plan. * ss.ss_ScanTupleSlot指向underlyling plan的输出(subplan) * ---------------- */typedef struct MaterialState{ ScanState ss; /* its first field is NodeTag */ int eflags; /* 传递给tuplestore的capability标记;capability flags to pass to tuplestore */ bool eof_underlying; /* 已经到达underlying plan的末尾?reached end of underlying plan? */ Tuplestorestate *tuplestorestate;} MaterialState;二、源码解读

ExecMaterial
执行物化操作。

/* ---------------------------------------------------------------- * ExecMaterial * * As long as we are at the end of the data collected in the tuplestore, * we collect one new row from the subplan on each call, and stash it * aside in the tuplestore before returning it. The tuplestore is * only read if we are asked to scan backwards, rescan, or mark/restore. * 只要在tuplestore中数据收集结束时,就会在每次调用时从subplan中收集一条新行, * 并在返回之前将其保存在tuplestore中。 * 只要在往后扫描、重新扫描或标记/恢复时tuplestore才会读取。 * * ---------------------------------------------------------------- */static TupleTableSlot * /* 从subplan中返回的结果;result tuple from subplan */ExecMaterial(PlanState *pstate){ MaterialState *node = castNode(MaterialState, pstate);//物化节点 EState *estate;//运行期状态 ScanDirection dir;//扫描方向 bool forward;//是否往前扫描 Tuplestorestate *tuplestorestate;//Tuplestorestate结构体指针 bool eof_tuplestore;//是否完成? TupleTableSlot *slot;//存储元组的slot CHECK_FOR_INTERRUPTS(); /* * get state info from node * 从物化节点中获取相关信息 */ estate = node->ss.ps.state; dir = estate->es_direction;//方向 forward = ScanDirectionIsForward(dir);//是否往前扫描 tuplestorestate = node->tuplestorestate; /* * If first time through, and we need a tuplestore, initialize it. * 第一次,需要tuplestore并初始化 */ if (tuplestorestate == NULL && node->eflags != 0) { tuplestorestate = tuplestore_begin_heap(true, false, work_mem); tuplestore_set_eflags(tuplestorestate, node->eflags); if (node->eflags & EXEC_FLAG_MARK) { /* * Allocate a second read pointer to serve as the mark. We know it * must have index 1, so needn't store that. * 分配用于mark的读指针 */ int ptrno PG_USED_FOR_ASSERTS_ONLY; ptrno = tuplestore_alloc_read_pointer(tuplestorestate, node->eflags); Assert(ptrno == 1); } node->tuplestorestate = tuplestorestate; } /* * If we are not at the end of the tuplestore, or are going backwards, try * to fetch a tuple from tuplestore. * 如果不在tuplestore的末尾或者正在往后扫描,尝试从tuplestore中提取一个元组 */ eof_tuplestore = (tuplestorestate == NULL) || tuplestore_ateof(tuplestorestate); if (!forward && eof_tuplestore) { if (!node->eof_underlying) { /* * When reversing direction at tuplestore EOF, the first * gettupleslot call will fetch the last-added tuple; but we want * to return the one before that, if possible. So do an extra * fetch. * 在EOF处反转方向,第一次的gettupleslot调用会提取最后添加的元组; * 但如可能,希望返回在此之前的元组,执行额外的提取操作。 */ if (!tuplestore_advance(tuplestorestate, forward)) return NULL; /* the tuplestore must be empty */ } eof_tuplestore = false; } /* * If we can fetch another tuple from the tuplestore, return it. * 如能从tuplestore中提取另外一个tuple,返回 */ slot = node->ss.ps.ps_ResultTupleSlot; if (!eof_tuplestore) { if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot)) return slot; if (forward) eof_tuplestore = true; } /* * If necessary, try to fetch another row from the subplan. * 如需要(tuplestore末尾),尝试从subplan中提取另外一行 * * Note: the eof_underlying state variable exists to short-circuit further * subplan calls. It's not optional, unfortunately, because some plan * node types are not robust about being called again when they've already * returned NULL. */ if (eof_tuplestore && !node->eof_underlying) { PlanState *outerNode; TupleTableSlot *outerslot; /* * We can only get here with forward==true, so no need to worry about * which direction the subplan will go. */ outerNode = outerPlanState(node); outerslot = ExecProcNode(outerNode); if (TupIsNull(outerslot)) { node->eof_underlying = true; return NULL; } /* * Append a copy of the returned tuple to tuplestore. NOTE: because * the tuplestore is certainly in EOF state, its read position will * move forward over the added tuple. This is what we want. * 追加返回的元组到tuplestore中。 * 注意:因为tuplestore当前处于EOF状态,读取的位置会前移至已添加的tuple前面,这是我们希望看到的。 */ if (tuplestorestate) tuplestore_puttupleslot(tuplestorestate, outerslot); ExecCopySlot(slot, outerslot); return slot; } /* * Nothing left ... */ return ExecClearTuple(slot);}

tuplestore_begin_heap
初始化tuplestore

/* * tuplestore_begin_heap * * Create a new tuplestore; other types of tuple stores (other than * "heap" tuple stores, for heap tuples) are possible, but not presently * implemented. * 创建新的tuplestore:目前仅实现了heap tuples。 * * randomAccess: if true, both forward and backward accesses to the * tuple store are allowed. * randomAccess : 如为T,支持往前和往后访问。 * * interXact: if true, the files used for on-disk storage persist beyond the * end of the current transaction. NOTE: It's the caller's responsibility to * create such a tuplestore in a memory context and resource owner that will * also survive transaction boundaries, and to ensure the tuplestore is closed * when it's no longer wanted. * interXact : 如为T,磁盘上的存储文件在当前事务结束后也会一直保持。 * 注意:调用者有责任在事务边界内存活的内存上下文和资源拥有者中创建tuplestore并确保不再使用时销毁tuplestore。 * * maxKBytes: how much data to store in memory (any data beyond this * amount is paged to disk). When in doubt, use work_mem. * maxKBytes:有多少数据需要存储到内存中(超长此大小的会分页到磁盘上)。 * 如存在问题,则使用work_mem。 */Tuplestorestate *tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes){ Tuplestorestate *state; int eflags; /* * This interpretation of the meaning of randomAccess is compatible with * the pre-8.3 behavior of tuplestores. */ eflags = randomAccess ? (EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) : (EXEC_FLAG_REWIND); state = tuplestore_begin_common(eflags, interXact, maxKBytes); state->copytup = copytup_heap; state->writetup = writetup_heap; state->readtup = readtup_heap; return state;}/* * tuplestore_begin_xxx * * Initialize for a tuple store operation. * 初始化tuplestore */static Tuplestorestate *tuplestore_begin_common(int eflags, bool interXact, int maxKBytes){ Tuplestorestate *state; state = (Tuplestorestate *) palloc0(sizeof(Tuplestorestate)); state->status = TSS_INMEM; state->eflags = eflags; state->interXact = interXact; state->truncated = false; state->allowedMem = maxKBytes * 1024L; state->availMem = state->allowedMem; state->myfile = NULL; state->context = CurrentMemoryContext; state->resowner = CurrentResourceOwner; state->memtupdeleted = 0; state->memtupcount = 0; state->tuples = 0; /* * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD; * see comments in grow_memtuples(). */ state->memtupsize = Max(16384 / sizeof(void *), ALLOCSET_SEPARATE_THRESHOLD / sizeof(void *) + 1); state->growmemtuples = true; state->memtuples = (void **) palloc(state->memtupsize * sizeof(void *)); USEMEM(state, GetMemoryChunkSpace(state->memtuples)); state->activeptr = 0; state->readptrcount = 1; state->readptrsize = 8; /* arbitrary */ state->readptrs = (TSReadPointer *) palloc(state->readptrsize * sizeof(TSReadPointer)); state->readptrs[0].eflags = eflags; state->readptrs[0].eof_reached = false; state->readptrs[0].current = 0; return state;}

tuplestore_advance
从tuplestore前进一行

/* * tuplestore_advance - exported function to adjust position without fetching * * We could optimize this case to avoid palloc/pfree overhead, but for the * moment it doesn't seem worthwhile. */booltuplestore_advance(Tuplestorestate *state, bool forward){ void *tuple; bool should_free; tuple = tuplestore_gettuple(state, forward, &should_free); if (tuple) { if (should_free) pfree(tuple); return true; } else { return false; }}

tuplestore_gettupleslot
获取slot

/* * tuplestore_gettupleslot - exported function to fetch a MinimalTuple * 提取MinimalTuple * * If successful, put tuple in slot and return true; else, clear the slot * and return false. * 如成功,则把元组塞进slot中并返回T,否则清空slot返回F * * If copy is true, the slot receives a copied tuple (allocated in current * memory context) that will stay valid regardless of future manipulations of * the tuplestore's state. If copy is false, the slot may just receive a * pointer to a tuple held within the tuplestore. The latter is more * efficient but the slot contents may be corrupted if additional writes to * the tuplestore occur. (If using tuplestore_trim, see comments therein.) * 如copy为T,则slot会接收拷贝之后的元组,独立于tuplestore的状态. * 如copy为F,则slot可能接收到tuplestore中的元组指针. */booltuplestore_gettupleslot(Tuplestorestate *state, bool forward, bool copy, TupleTableSlot *slot){ MinimalTuple tuple; bool should_free; tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free); if (tuple) { if (copy && !should_free) { tuple = heap_copy_minimal_tuple(tuple); should_free = true; } ExecStoreMinimalTuple(tuple, slot, should_free); return true; } else { ExecClearTuple(slot); return false; }}

tuplestore_gettuple
返回下一个元组

/* * Fetch the next tuple in either forward or back direction. * Returns NULL if no more tuples. If should_free is set, the * caller must pfree the returned tuple when done with it. * 往前/后返回下一个元组。 * 如无更多元组,返回NULL。如should_free有值,调用者必须在处理完毕后释放返回的元组 * * Backward scan is only allowed if randomAccess was set true or * EXEC_FLAG_BACKWARD was specified to tuplestore_set_eflags(). * 在randomAccess设置为T或者指定EXEC_FLAG_BACKWARD时才允许。 */static void *tuplestore_gettuple(Tuplestorestate *state, bool forward, bool *should_free){ TSReadPointer *readptr = &state->readptrs[state->activeptr];//读取指针 unsigned int tuplen; void *tup; Assert(forward || (readptr->eflags & EXEC_FLAG_BACKWARD)); switch (state->status) { case TSS_INMEM://内存中 *should_free = false; if (forward) { if (readptr->eof_reached) return NULL; if (readptr->current < state->memtupcount) { /* We have another tuple, so return it */ return state->memtuples[readptr->current++]; } readptr->eof_reached = true; return NULL; } else { /* * if all tuples are fetched already then we return last * tuple, else tuple before last returned. */ if (readptr->eof_reached) { readptr->current = state->memtupcount; readptr->eof_reached = false; } else { if (readptr->current <= state->memtupdeleted) { Assert(!state->truncated); return NULL; } readptr->current--; /* last returned tuple */ } if (readptr->current <= state->memtupdeleted) { Assert(!state->truncated); return NULL; } return state->memtuples[readptr->current - 1]; } break; case TSS_WRITEFILE://写文件 /* Skip state change if we'll just return NULL */ //如只需要返回NULL则跳过状态变换 if (readptr->eof_reached && forward) return NULL; /* * Switch from writing to reading. * 从写切换至读 */ BufFileTell(state->myfile, &state->writepos_file, &state->writepos_offset); if (!readptr->eof_reached) if (BufFileSeek(state->myfile, readptr->file, readptr->offset, SEEK_SET) != 0) ereport(ERROR, (errcode_for_file_access(), errmsg("could not seek in tuplestore temporary file: %m"))); state->status = TSS_READFILE; /* FALLTHROUGH */ //进入读文件状态的处理逻辑 case TSS_READFILE: *should_free = true; if (forward) { //往前读 if ((tuplen = getlen(state, true)) != 0) { tup = READTUP(state, tuplen); return tup; } else { readptr->eof_reached = true; return NULL; } } /* * Backward. * 往后读 * * if all tuples are fetched already then we return last tuple, * else tuple before last returned. * 如果所有元组时已提取,则返回最后一个元组,否则返回先前最后返回的元组 * * Back up to fetch previously-returned tuple's ending length * word. If seek fails, assume we are at start of file. * 往回向上提取先前已返回的元组结束长度字,如检索失败,假定处于文件的开始位置. */ if (BufFileSeek(state->myfile, 0, -(long) sizeof(unsigned int), SEEK_CUR) != 0) { /* even a failed backwards fetch gets you out of eof state */ readptr->eof_reached = false; Assert(!state->truncated); return NULL; } tuplen = getlen(state, false); if (readptr->eof_reached) { readptr->eof_reached = false; /* We will return the tuple returned before returning NULL */ //在返回NULL前返回先前已返回的元组 } else { /* * Back up to get ending length word of tuple before it. * 获取结束长度字 */ if (BufFileSeek(state->myfile, 0, -(long) (tuplen + 2 * sizeof(unsigned int)), SEEK_CUR) != 0) { /* * If that fails, presumably the prev tuple is the first * in the file. Back up so that it becomes next to read * in forward direction (not obviously right, but that is * what in-memory case does). */ if (BufFileSeek(state->myfile, 0, -(long) (tuplen + sizeof(unsigned int)), SEEK_CUR) != 0) ereport(ERROR, (errcode_for_file_access(), errmsg("could not seek in tuplestore temporary file: %m"))); Assert(!state->truncated); return NULL; } tuplen = getlen(state, false); } /* * Now we have the length of the prior tuple, back up and read it. * Note: READTUP expects we are positioned after the initial * length word of the tuple, so back up to that point. * 已获得优先元组的长度,读取之. */ if (BufFileSeek(state->myfile, 0, -(long) tuplen, SEEK_CUR) != 0) ereport(ERROR, (errcode_for_file_access(), errmsg("could not seek in tuplestore temporary file: %m"))); tup = READTUP(state, tuplen); return tup; default: elog(ERROR, "invalid tuplestore state"); return NULL; /* keep compiler quiet */ }}三、跟踪分析

执行SQL:

[pg12@localhost ~]$ psql -d testdbTiming is on.Expanded display is used automatically.psql (12.0)Type "help" for help.[local]:5432 pg12@testdb=# [local]:5432 pg12@testdb=# select * from tbl; id | value ----+------- 1 | 2(1 row)Time: 2.678 ms[local]:5432 pg12@testdb=# select count(*) from t_big_null; count ---------- 10000001(1 row)Time: 679.972 ms[local]:5432 pg12@testdb=# analyze tbl;ANALYZETime: 64.442 ms[local]:5432 pg12@testdb=# analyze t_big_null;ANALYZETime: 434.702 ms[local]:5432 pg12@testdb=# [local]:5432 pg12@testdb=# select pg_backend_pid(); pg_backend_pid ---------------- 18758(1 row)Time: 1.990 ms[local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_big_null b);

启动gdb跟踪

(gdb) b ExecMaterialBreakpoint 1 at 0x720edb: file nodeMaterial.c, line 41.(gdb) cContinuing.Breakpoint 1, ExecMaterial (pstate=0x1230128) at nodeMaterial.c:4141 MaterialState *node = castNode(MaterialState, pstate);(gdb)

单步调试

(gdb) n49 CHECK_FOR_INTERRUPTS();(gdb) 54 estate = node->ss.ps.state;(gdb) 55 dir = estate->es_direction;(gdb) 56 forward = ScanDirectionIsForward(dir);(gdb) 57 tuplestorestate = node->tuplestorestate;(gdb) 62 if (tuplestorestate == NULL && node->eflags != 0)(gdb) 64 tuplestorestate = tuplestore_begin_heap(true, false, work_mem);(gdb) 65 tuplestore_set_eflags(tuplestorestate, node->eflags);(gdb) 66 if (node->eflags & EXEC_FLAG_MARK)(gdb) 78 node->tuplestorestate = tuplestorestate;(gdb) 85 eof_tuplestore = (tuplestorestate == NULL) ||(gdb) 86 tuplestore_ateof(tuplestorestate);(gdb) 85 eof_tuplestore = (tuplestorestate == NULL) ||(gdb) 88 if (!forward && eof_tuplestore)(gdb) p eof_tuplestore$1 = false(gdb)

进入tuplestore_gettupleslot

(gdb) n107 slot = node->ss.ps.ps_ResultTupleSlot;(gdb) 108 if (!eof_tuplestore)(gdb) 110 if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot))(gdb) steptuplestore_gettupleslot (state=0x3069c18, forward=true, copy=false, slot=0x30687a8) at tuplestore.c:10841084 tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);(gdb)

进入tuplestore_gettuple

(gdb) steptuplestore_gettuple (state=0x3069c18, forward=true, should_free=0x7ffd18474ff7) at tuplestore.c:906906 TSReadPointer *readptr = &state->readptrs[state->activeptr];(gdb)

tuplestore_gettuple->文件读写指针信息

(gdb) n910 Assert(forward || (readptr->eflags & EXEC_FLAG_BACKWARD));(gdb) p *readptr$2 = {eflags = 2, eof_reached = false, current = 0, file = 2139062143, offset = 9187201950435737471}

tuplestore_gettuple->当前状态为TSS_INMEM

(gdb) n912 switch (state->status)(gdb) p *state$3 = {status = TSS_INMEM, eflags = 2, backward = false, interXact = false, truncated = false, availMem = 4177896, allowedMem = 4194304, tuples = 0, myfile = 0x0, context = 0x3067da0, resowner = 0x2fa62c8, copytup = 0xaba7bd <copytup_heap>, writetup = 0xaba811 <writetup_heap>, readtup = 0xaba9d9 <readtup_heap>, memtuples = 0x3051e90, memtupdeleted = 0, memtupcount = 0, memtupsize = 2048, growmemtuples = true, readptrs = 0x3077f70, activeptr = 0, readptrcount = 1, readptrsize = 8, writepos_file = 0, writepos_offset = 0}(gdb) p state->status$4 = TSS_INMEM(gdb)

tuplestore_gettuple->返回NULL

(gdb) n915 *should_free = false;(gdb) n916 if (forward)(gdb) 918 if (readptr->eof_reached)(gdb) 920 if (readptr->current < state->memtupcount)(gdb) p readptr->current$5 = 0(gdb) p state->memtupcount$6 = 0(gdb) n925 readptr->eof_reached = true;(gdb) 926 return NULL;(gdb) 1062 }(gdb)

tuplestore_gettupleslot->返回false

(gdb) ntuplestore_gettupleslot (state=0x3069c18, forward=true, copy=false, slot=0x30687a8) at tuplestore.c:10861086 if (tuple)(gdb) 1098 ExecClearTuple(slot);(gdb) 1099 return false;(gdb)

回到ExecMaterial

(gdb) n1101 }(gdb) ExecMaterial (pstate=0x3068158) at nodeMaterial.c:112112 if (forward)(gdb) 113 eof_tuplestore = true;(gdb)

从outerPlan中获取一行(即从t_big_null中获取一行)

(gdb) n124 if (eof_tuplestore && !node->eof_underlying)(gdb) p node->eof_underlying$7 = false(gdb) n133 outerNode = outerPlanState(node);(gdb) ####define innerPlanState(node) (((PlanState *)(node))->righttree)#define outerPlanState(node) (((PlanState *)(node))->lefttree)###134 outerslot = ExecProcNode(outerNode);(gdb) p outerNode$8 = (PlanState *) 0x3068270(gdb) p *outerNode$9 = {type = T_SeqScanState, plan = 0x3037628, state = 0x3067eb8, ExecProcNode = 0x6f802a <ExecProcNodeFirst>, ExecProcNodeReal = 0x72b904 <ExecSeqScan>, instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0, qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleDesc = 0x3068578, ps_ResultTupleSlot = 0x0, ps_ExprContext = 0x3068388, ps_ProjInfo = 0x0, scandesc = 0x7fab449cae98, scanops = 0xc3e780 <TTSOpsBufferHeapTuple>, outerops = 0x0, innerops = 0x0, resultops = 0xc3e780 <TTSOpsBufferHeapTuple>, scanopsfixed = true, outeropsfixed = false, inneropsfixed = false, resultopsfixed = true, scanopsset = true, outeropsset = false, inneropsset = false, resultopsset = true}(gdb) p *outerNode->state$10 = {type = T_EState, es_direction = ForwardScanDirection, es_snapshot = 0x2f9cd10, es_crosscheck_snapshot = 0x0, es_range_table = 0x3042130, es_range_table_array = 0x3068108, es_range_table_size = 2, es_relations = 0x3068130, es_rowmarks = 0x0, es_plannedstmt = 0x3042438, es_sourceText = 0x2f74d88 "select * from tbl a where a.id not in (select b.id from t_big_null b);", es_junkFilter = 0x0, es_output_cid = 0, es_result_relations = 0x0, es_num_result_relations = 0, es_result_relation_info = 0x0, es_root_result_relations = 0x0, es_num_root_result_relations = 0, es_partition_directory = 0x0, es_tuple_routing_result_relations = 0x0, es_trig_target_relations = 0x0, es_param_list_info = 0x0, es_param_exec_vals = 0x30680d0, es_queryEnv = 0x0, es_query_cxt = 0x3067da0, es_tupleTable = 0x3068540, es_processed = 0, es_top_eflags = 16, es_instrument = 0, es_finished = false, es_exprcontexts = 0x3068448, es_subplanstates = 0x3068950, es_auxmodifytables = 0x0, es_per_tuple_exprcontext = 0x0, es_epq_active = 0x0, es_use_parallel_mode = false, es_query_dsa = 0x0, es_jit_flags = 25, es_jit = 0x0, es_jit_worker_instr = 0x0}(gdb) p ((PlanState *)node)->righttree$21 = (struct PlanState *) 0x0(gdb)

回过头来看执行计划,Materialize Node的lefttree是Seq Scan on public.t_big_null b,righttree为NULL。

[local]:5432 pg12@testdb=# explain verbose select * from tbl a where a.id not in (select b.id from t_big_null b); QUERY PLAN -------------------------------------------------------------------------------------------- Seq Scan on public.tbl a (cost=0.00..129156.33 rows=1 width=8) Output: a.id, a.value Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..233310.68 rows=9999979 width=4) Output: b.id -> Seq Scan on public.t_big_null b (cost=0.00..144247.79 rows=9999979 width=4) Output: b.id(8 rows)Time: 7.681 ms

获取outerslot

(gdb) n135 if (TupIsNull(outerslot))(gdb) p *outerslot$16 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 0, tts_ops = 0xc3e780 <TTSOpsBufferHeapTuple>, tts_tupleDescriptor = 0x7fab449cae98, tts_values = 0x30684f0, tts_isnull = 0x30684f8, tts_mcxt = 0x3067da0, tts_tid = { ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 1}, tts_tableOid = 49155}(gdb) p *outerslot->tts_values$17 = 0(gdb) p outerslot->tts_values[1]$18 = 0(gdb) p outerslot->tts_values[0]$19 = 0(gdb) p *outerslot->tts_tupleDescriptor$20 = {natts = 1, tdtypeid = 49157, tdtypmod = -1, tdrefcount = 2, constr = 0x0, attrs = 0x7fab449caeb0}

获取outerslot后,put到tuplestore中

(gdb) p *node$22 = {ss = {ps = {type = T_MaterialState, plan = 0x3040a60, state = 0x3067eb8, ExecProcNode = 0x720ecf <ExecMaterial>, ExecProcNodeReal = 0x720ecf <ExecMaterial>, instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0, qual = 0x0, lefttree = 0x3068270, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleDesc = 0x3068690, ps_ResultTupleSlot = 0x30687a8, ps_ExprContext = 0x0, ps_ProjInfo = 0x0, scandesc = 0x3068578, scanops = 0xc3e720 <TTSOpsMinimalTuple>, outerops = 0x0, innerops = 0x0, resultops = 0xc3e720 <TTSOpsMinimalTuple>, scanopsfixed = true, outeropsfixed = false, inneropsfixed = false, resultopsfixed = true, scanopsset = true, outeropsset = false, inneropsset = false, resultopsset = true}, ss_currentRelation = 0x0, ss_currentScanDesc = 0x0, ss_ScanTupleSlot = 0x3068868}, eflags = 2, eof_underlying = false, tuplestorestate = 0x3069c18}(gdb) n146 if (tuplestorestate)(gdb) 147 tuplestore_puttupleslot(tuplestorestate, outerslot);(gdb) p outerslot->tts_values[0]$23 = 0(gdb) n149 ExecCopySlot(slot, outerslot);(gdb) p outerslot->tts_values[0]$24 = 0(gdb) n150 return slot;(gdb) p outerslot->tts_values[0]$25 = 0(gdb) p slot->tts_values[0]$26 = 0(gdb) n157 }(gdb)

继续“物化”

(gdb) nExecProcNodeFirst (node=0x3068158) at execProcnode.c:446446 }(gdb) cContinuing.Breakpoint 1, ExecMaterial (pstate=0x3068158) at nodeMaterial.c:4141 MaterialState *node = castNode(MaterialState, pstate);(gdb) n49 CHECK_FOR_INTERRUPTS();(gdb) 54 estate = node->ss.ps.state;(gdb) 55 dir = estate->es_direction;(gdb) 56 forward = ScanDirectionIsForward(dir);(gdb) 57 tuplestorestate = node->tuplestorestate;(gdb) 62 if (tuplestorestate == NULL && node->eflags != 0)(gdb) 85 eof_tuplestore = (tuplestorestate == NULL) ||(gdb) 86 tuplestore_ateof(tuplestorestate);(gdb) 85 eof_tuplestore = (tuplestorestate == NULL) ||(gdb) 88 if (!forward && eof_tuplestore)(gdb) 107 slot = node->ss.ps.ps_ResultTupleSlot;(gdb) 108 if (!eof_tuplestore)(gdb) 124 if (eof_tuplestore && !node->eof_underlying)(gdb) 133 outerNode = outerPlanState(node);(gdb) p eof_tuplestore$27 = true(gdb) n134 outerslot = ExecProcNode(outerNode);(gdb) 135 if (TupIsNull(outerslot))(gdb) 146 if (tuplestorestate)(gdb) 147 tuplestore_puttupleslot(tuplestorestate, outerslot);(gdb) 149 ExecCopySlot(slot, outerslot);(gdb) 150 return slot;(gdb) p slot->tts_values[0]$28 = 2(gdb)

第一次执行时间较久,第二次相对快2个数量级,需要继续研究。

[local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_big_null b); id | value ----+-------(0 rows)Time: 3633462.666 ms (01:00:33.463) --> 包括了debug的时间,实际时间是5s左右[local]:5432 pg12@testdb=# [local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_big_null b); id | value ----+-------(0 rows)Time: 6.480 ms --> 第2+次就快很多[local]:5432 pg12@testdb=#

DONE

四、参考资料

N/A