在分布式数据库中怎么计算count

2025-03-11 技术教程

这篇文章主要介绍“在分布式数据库中怎么计算count”，在日常操作中，相信很多人在在分布式数据库中怎么计算count问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”在分布式数据库中怎么计算count”的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

背景

在分布式数据库中，计算count(distinct xxx)，需要对distinct 的字段，

1、去重，

2、重分布去重后的数据，（这一步，如果distinct值特别多，那么就会比较耗时）

3、然后再去重，

4、最后count (xxx)，

5、求所有节点的count SUM。

例如，以下是Greenplum的执行计划例子

postgres=#explainanalyzeselectcount(distinctc_acctbal)fromcustomer;QUERYPLAN--------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate(cost=182242.41..182242.42rows=1width=8)Rowsout:1rowswith0.006mstofirstrow,69mstoend,startoffsetby23ms.->GatherMotion16:1(slice2;segments:16)(cost=53392.85..173982.82rows=660767width=8)Rowsout:818834rowsatdestinationwith3.416mstofirstrow,447mstoend,startoffsetby23ms.->HashAggregate(cost=53392.85..61652.43rows=41298width=8)GroupBy:customer.c_acctbalRowsout:Avg51177.1rowsx16workers.Max51362rows(seg3)with0.004mstofirstrow,33mstoend,startoffsetby25ms.->RedistributeMotion16:16(slice1;segments:16)(cost=30266.00..43481.34rows=41298width=8)HashKey:customer.c_acctbalRowsout:Avg89865.6rowsx16workersatdestination.Max90305rows(seg3)with18mstofirstrow,120mstoend,startoffsetby25ms.->HashAggregate(cost=30266.00..30266.00rows=41298width=8)GroupBy:customer.c_acctbalRowsout:Avg89865.6rowsx16workers.Max89929rows(seg2)with0.007mstofirstrow,33mstoend,startoffsetby26ms.->Append-onlyColumnarScanoncustomer(cost=0.00..22766.00rows=93750width=8)Rowsout:Avg93750.0rowsx16workers.Max93751rows(seg4)with20mstofirstrow,30mstoend,startoffsetby26ms.Slicestatistics:(slice0)Executormemory:387Kbytes.(slice1)Executormemory:6527Kbytesavgx16workers,6527Kbytesmax(seg0).(slice2)Executormemory:371Kbytesavgx16workers,371Kbytesmax(seg0).Statementstatistics:Memoryused:1280000KbytesOptimizerstatus:legacyqueryoptimizerTotalruntime:723.143ms(23rows)

以下是citus的例子

postgres=#explainanalyzeselectcount(distinctbid)frompgbench_accounts;QUERYPLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate(cost=0.00..0.00rows=0width=0)(actualtime=31.748..31.749rows=1loops=1)->CustomScan(CitusReal-Time)(cost=0.00..0.00rows=0width=0)(actualtime=31.382..31.510rows=1280loops=1)TaskCount:128TasksShown:Oneof128->TaskNode:host=172.24.211.224port=1921dbname=postgres->HashAggregate(cost=231.85..231.95rows=10width=4)(actualtime=3.700..3.702rows=10loops=1)GroupKey:bid->SeqScanonpgbench_accounts_106812pgbench_accounts(cost=0.00..212.48rows=7748width=4)(actualtime=0.017..2.180rows=7748loops=1)Planningtime:0.445msExecutiontime:3.781msPlanningtime:1.399msExecutiontime:32.159ms(13rows)

对于可估值计算的场景，即不需要精确distinct值的场景，PostgreSQL提供了一个名为hll的插件，可以用来估算distinct元素个数。

citus 结合hll，可以实现超高速的count(distinct xxx)，即使distinct值非常非常多，也不慢。

SETcitus.count_distinct_error_rateto0.005;0.005表示失真度hll加速citus count(distinct xxx)使用举例部署

1、所有节点（coordinator 与 worker节点），安装hll软件

yuminstall-ygcc-c++cd~/gitclonehttps://github.com/citusdata/postgresql-hllcdpostgresql-hll./var/lib/pgsql/.bash_profileUSE_PGXS=1makeUSE_PGXS=1makeinstall

2、所有节点（coordinator 与 worker节点），在需要用到HLL的DB中增加插件

su-postgres-c"psql-dpostgres-c'createextensionhll;'"su-postgres-c"psql-dnewdb-c'createextensionhll;'"使用举例1、创建测试表，128 shard

createtabletest(idintprimarykey,aint,bint,cint);setcitus.shard_count=128;selectcreate_distributed_table('test','id');2、写入10亿测试数据，a字段10唯一值，b字段100唯一值，c字段100万唯一值

insertintotestselectid,random()*9,random()*99,random()*999999fromgenerate_series(1,1000000000)t(id);3、（coordinator节点）设置全局或当前会话级参数，指定失真度，越小失真度越小

SETcitus.count_distinct_error_rateto0.005;newdb=#explainselectcount(distinctbid)frompgbench_accountsgroupbybid;QUERYPLAN-------------------------------------------------------------------------------------------------------------------------------HashAggregate(cost=0.00..0.00rows=0width=0)GroupKey:remote_scan.worker_column_2->CustomScan(CitusReal-Time)(cost=0.00..0.00rows=0width=0)TaskCount:128TasksShown:Oneof128->TaskNode:host=172.24.211.224port=8001dbname=newdb->GroupAggregate(cost=97272.79..105102.29rows=1000width=36)GroupKey:bid->Sort(cost=97272.79..99227.04rows=781700width=4)SortKey:bid->SeqScanonpgbench_accounts_102008pgbench_accounts(cost=0.00..20759.00rows=781700width=4)(12rows)4、对比是否使用HLL加速(少量唯一值，HLL没有性能提升，因为本身就不存在瓶颈)4.1、未使用hll

newdb=#setcitus.count_distinct_error_rateto0;newdb=#selectcount(distinctbid)frompgbench_accounts;count-------1000(1row)Time:423.364mspostgres=#setcitus.count_distinct_error_rateto0;postgres=#selectcount(distincta)fromtest;count-------10(1row)Time:2392.709ms(00:02.393)4.2、使用hll

newdb=#setcitus.count_distinct_error_rateto0.005;newdb=#selectcount(distinctbid)frompgbench_accounts;count-------1000(1row)Time:444.287mspostgres=#setcitus.count_distinct_error_rateto0.005;postgres=#selectcount(distincta)fromtest;count-------10(1row)Time:2375.473ms(00:02.375)5、对比是否使用HLL加速(大量唯一值，HLL性能提升显著)5.1、未使用hll

postgres=#setcitus.count_distinct_error_rateto0;count----------10000000(1row)Time:5826241.205ms(01:37:06.241)

128个节点，每个节点最多发送10亿/128条数据给coordinator，慢是可以理解的。另一方面，coordinator可以边接收边去重(postgresql 11增加了parallel gather, merge sort等能力，citus coordinator可以借鉴)，没必要等所有数据都收完再去重。

5.2、使用hll

postgres=#setcitus.count_distinct_error_rateto0.005;postgres=#selectcount(distinct(a,c))fromtest;count---------9999995(1row)Time:4468.749ms(00:04.469)6、设置不同的精度参数，性能对比

newdb=#setcitus.count_distinct_error_rateto0.1;newdb=#selectcount(distinct(aid,bid))frompgbench_accounts;count----------94778491(1row)Time:545.301msnewdb=#setcitus.count_distinct_error_rateto0.01;newdb=#selectcount(distinct(aid,bid))frompgbench_accounts;count-----------100293937(1row)Time:554.333ms--推荐设置0.005newdb=#setcitus.count_distinct_error_rateto0.005;newdb=#selectcount(distinct(aid,bid))frompgbench_accounts;count-----------100136086(1row)Time:1053.070ms(00:01.053)newdb=#setcitus.count_distinct_error_rateto0.001;newdb=#selectcount(distinct(aid,bid))frompgbench_accounts;count-----------100422107(1row)Time:9287.934ms(00:09.288)

到此，关于“在分布式数据库中怎么计算count”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注亿速云网站，小编会继续努力为大家带来更多实用的文章！