这篇文章主要介绍“PostgreSQL的相似搜索插件有哪些”,在日常操作中,相信很多人在PostgreSQL的相似搜索插件有哪些问题上存在疑惑,小编查阅了各式资料,整理出简单好用的操作方法,希望对大家解答”PostgreSQL的相似搜索插件有哪些”的疑惑有所帮助!接下来,请跟着小编一起来学习吧!

类别1 : 元素重叠度相似

类似倒排,以元素重叠度为基准的相似计算。广泛应用于数组、全文检索、字符串、文本特征值、多列任意组合查询的相似搜索。

代表的PostgreSQL插件如下

1、rum

https://github.com/postgrespro/rum

2、pg_trgm

https://www.postgresql.org/docs/devel/static/pgtrgm.html

3、smlar

http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

4、smlar+海明码(向量相似)

《海量数据,海明(simhash)距离高效检索(smlar) - 阿里云RDS PosgreSQL最佳实践》

5、pg_similarity

https://github.com/eulerto/pg_similarity

类别2 : 向量相似(类似knn距离)

向量相似与元素重叠度计算,显然是不同的,基于元素的重叠度相似,可以利用倒排来实现,如上节描述。而基于元素向量相似,需要用到自定义的索引接口,典型的代表是GiST索引在空间距离上的计算,以及imgsmlr插件在图像特征值相似方面的计算。

1、imgsmlr(图片向量相似)

https://github.com/postgrespro/imgsmlr

原理如下

64*64的图像,取16个区域的平均值,生成16个浮点数,作为图像特征值。

一个值求相似,相减绝对值最小。

2个值求相似,可以理解为平面坐标,求距离最小(GiST knn距离排序)。

3个值求相似,可以理解为3D坐标里面的点,求距离最小的点。

...

16个值求相似,与上类似。imgsmlr插件使用gist索引接口实现了16个元素的向量相似索引排序。

例子

postgres=#\dt_imgTable"public.t_img"Column|Type|Collation|Nullable|Default--------+-----------+-----------+----------+---------id|integer||notnull|sig|signature|||Indexes:"t_img_pkey"PRIMARYKEY,btree(id)"idx_t_img_1"gist(sig)

数据量

postgres=#selectcount(*)fromt_img;count-----------319964709(1row)Time:698.075ms

图像特征值搜索例子,速度杠杠的。(以上使用citus+postgres+128 shard)

postgres=#select*fromt_imgorderbysig<->'(3.539080,0.243861,1.509150,1.781380,8.677560,4.232060,8.979810,1.665030,1.294100,4.449800,9.200450,1.859860,5.440250,7.788580,0.514258,8.424920)'limit1;id|sig-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------148738668|(2.554440,0.310499,2.322520,0.478624,7.816080,4.360440,8.287050,1.011060,2.114320,3.541110,9.166300,1.922250,4.488640,7.897890,1.600290,7.462080)(1row)Time:337.301ms2 CUBE

https://www.postgresql.org/docs/devel/static/cube.html

a<->bfloat8Euclideandistancebetweenaandb.a<#>bfloat8Taxicab(L-1metric)distancebetweenaandb.a<=>bfloat8Chebyshev(L-infmetric)distancebetweenaandb.

计算图片向量相似时,cube比imgsmlr性能稍差,因为cube使用的是float8,而imgsmlr使用的是float4。

例子

cube

postgres=#explain(analyze,verbose,timing,costs,buffers)select*fromt_img0orderbysig::Text::cube<->'(0.435404,6.602870,9.050220,9.379750,2.483920,1.534660,0.363753,4.079670,0.124681,3.611220,7.127460,7.880070,2.574830,6.778820,5.156320,8.329430)'limit1;QUERYPLAN----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Limit(cost=0.36..0.37rows=1width=76)(actualtime=147.432..147.434rows=1loops=1)Output:id,sig,((((sig)::text)::cube<->'(0.435404,6.60287,9.05022,9.37975,2.48392,1.53466,0.363753,4.07967,0.124681,3.61122,7.12746,7.88007,2.57483,6.77882,5.15632,8.32943)'::cube))Buffers:sharedhit=16032->IndexScanusingidx_t_img0_1onpublic.t_img0(cost=0.36..13824.28rows=754085width=76)(actualtime=147.430..147.430rows=1loops=1)Output:id,sig,(((sig)::text)::cube<->'(0.435404,6.60287,9.05022,9.37975,2.48392,1.53466,0.363753,4.07967,0.124681,3.61122,7.12746,7.88007,2.57483,6.77882,5.15632,8.32943)'::cube)OrderBy:(((t_img0.sig)::text)::cube<->'(0.435404,6.60287,9.05022,9.37975,2.48392,1.53466,0.363753,4.07967,0.124681,3.61122,7.12746,7.88007,2.57483,6.77882,5.15632,8.32943)'::cube)Buffers:sharedhit=16032PlanningTime:0.096msExecutionTime:148.905ms(9rows)

imgsmlr

postgres=#explain(analyze,verbose,timing,costs,buffers)select*fromt_img0orderbysig<->'(0.435404,6.602870,9.050220,9.379750,2.483920,1.534660,0.363753,4.079670,0.124681,3.611220,7.127460,7.880070,2.574830,6.778820,5.156320,8.329430)'limit2;QUERYPLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Limit(cost=0.36..0.37rows=2width=72)(actualtime=40.284..48.183rows=2loops=1)Output:id,sig,((sig<->'(0.435404,6.602870,9.050220,9.379750,2.483920,1.534660,0.363753,4.079670,0.124681,3.611220,7.127460,7.880070,2.574830,6.778820,5.156320,8.329430)'::signature))Buffers:sharedhit=2914->IndexScanusingt_img0_sig_idxonpublic.t_img0(cost=0.36..7032.36rows=754085width=72)(actualtime=40.282..48.179rows=2loops=1)Output:id,sig,(sig<->'(0.435404,6.602870,9.050220,9.379750,2.483920,1.534660,0.363753,4.079670,0.124681,3.611220,7.127460,7.880070,2.574830,6.778820,5.156320,8.329430)'::signature)OrderBy:(t_img0.sig<->'(0.435404,6.602870,9.050220,9.379750,2.483920,1.534660,0.363753,4.079670,0.124681,3.611220,7.127460,7.880070,2.574830,6.778820,5.156320,8.329430)'::signature)Buffers:sharedhit=2914PlanningTime:0.091msExecutionTime:48.210ms(9rows)

cube相比imgsmlr的好处是:cube可以计算任意维度的向量相似,imgsmlr则仅用于计算16维(signation类型)的向量相似

到此,关于“PostgreSQL的相似搜索插件有哪些”的学习就结束了,希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习,快去试试吧!若想继续学习更多相关知识,请继续关注亿速云网站,小编会继续努力为大家带来更多实用的文章!