提高PostgresSQL聚合查询性能

编程入门行业动态更新时间:2024-10-09 03:32:06

本文介绍了提高PostgresSQL聚合查询性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在从Postgres表中聚合数据，查询大约需要2秒钟，我想减少到不到一秒钟。

请在下面找到执行详细信息：

查询

选择 a.search_keyword， hll_cardinality（hll_union_agg（a.users ））:: int as user_count， hll_cardinality（hll_union_agg（a.sessions））:: int as session_count， sum（a.total）as keyword_count from rollup_day a ，其中 a.created_date在'2018-09-01'和'2019-09-30'之间和a.tenant_id ='62850a62-19ac-477d-9cd7-837f3d716885'组由 a.search_keyword 订单由 session_count desc 限制100;

表元数据

总行数- 506527

列的综合索引： tenant_id和created_date

查询计划

自定义扫描（成本= 0.00..0.00行= 0宽度= 0）（实际时间= 1722.685..1722.694行= 100循环= 1）任务计数：1 显示的任务：全部->任务节点：host = localhost端口= 5454 dbname = postgres ->限制（cost = 64250.24..64250.49行= 100宽度= 42）（实际时间= 1783.087..1783.106行= 100循环= 1）->排序（cost = 64250.24..64558.81行= 123430宽度= 42）（实际时间= 1783.085..1783.093行= 100循环= 1）排序键：（（hll_cardinality（hll_union_agg（sessions）））:: integer ）DESC 排序方法：top-N堆排序内存：33kB -> GroupAggregate（费用= 52933.89..59532.83行= 123430宽度= 42）（实际时间= 905.502..1724.363行= 212633循环= 1）组关键字：search_keyword ->排序（cost = 52933.89..53636.53行= 281055宽度= 54）（实际时间= 905.483..1351.212行= 280981循环= 1）排序键：search_keyword 排序方法：外部合并磁盘：18496kB ->在rollup_day a上进行Seq扫描（成本= 0.00..17890.22行= 281055宽度= 54）（实际时间= 29.720..112.161行= 280981循环= 1）过滤器：（（created_date> ='2018-09 -01':: date）AND（created_date< ='2019-09-30':: date）AND（tenant_id ='62850a62-19ac-477d-9cd7-837f3d716885':: uuid））已删除行通过筛选器：225546 计划时间：0.129 ms 执行时间：1786.222 ms 计划时间：0.103 ms 执行时间：1722.718 ms

我尝试过的事情

我尝试使用 tenant_id和created_date 上的索引，但是由于数据量很大，因此它总是在进行顺序扫描，而不是对过滤器进行索引扫描。我已阅读并发现，如果返回的数据占总行的 5-10％，则Postgres查询引擎会切换到顺序扫描。请点击链接以获取更多参考。

我已将 work_mem 增加到 100MB ，但它只提高了一点性能。

任何帮助将不胜感激。

在将 work_mem设置为100MB后，更新

查询计划/ blockquote>

自定义扫描（成本= 0.00..0.00行= 0宽度= 0）（实际时间= 1375.926..1375.935行= 100 loops = 1）任务计数：1 显示的任务：全部->任务节点：host = localhost端口= 5454 dbname = postgres ->限制（成本= 48348.85..48349.10行= 100宽度= 42）（实际时间= 1307.072..1307.093行= 100循环= 1）->排序（cost = 48348.85..48633.55行= 113880宽度= 42）（实际时间= 1307.071..1307.080行= 100循环= 1）排序键：（sum（total））DESC 排序方法：前N个堆排序内存：35kB -> GroupAggregate（费用= 38285.79..43996.44行= 113880宽度= 42）（实际时间= 941.504..1261.177行= 172945循环= 1）组关键字：search_keyword ->排序（cost = 38285.79..38858.52行= 229092宽度= 54）（实际时间= 941.484..963.061行= 227261循环= 1）排序键：search_keyword 排序方法：quicksort内存：32982kB ->在rollup_day_104290 a（成本= 0.00..17890.22行= 229092宽度= 54）上进行序列扫描（实际时间= 38.803..104.350行= 227261循环= 1）过滤器：（（created_date> ='2019-01 -01':: date）AND（created_date< ='2019-12-30':: date）AND（tenant_id ='62850a62-19ac-477d-9cd7-837f3d716885':: uuid））已删除行通过过滤器：279266 计划时间：0.131 ms 执行时间：1308.814 ms 计划时间：0.112 ms 执行时间：1375.961 ms

更新2

在created_date创建索引后，将 work_mem增加到120MB

在rollup_day（created_date）上创建索引date_idx;

行是： 12,124,608

查询计划是：

自定义扫描（成本= 0.00..0.00行= 0宽度= 0）（实际时间= 2635.530..2635.540行= 100循环= 1）任务计数：1 显示的任务：全部->任务节点：host = localhost端口= 9702 dbname = postgres ->限制（费用= 73545.19..73545.44行= 100宽度= 51）（实际时间= 2755.849..2755.873行= 100循环= 1）->排序（cost = 73545.19..73911.25行= 146424宽度= 51）（实际时间= 2755.847..2755.858行= 100循环= 1）排序键：（sum（total））DESC 排序方法：前N个堆排序内存：35kB -> GroupAggregate（费用= 59173.97..67948.97行= 146424宽度= 51）（实际时间= 2014.260..2670.732行= 296537循环= 1）组关键字：search_keyword ->排序（成本= 59173.97..60196.85行= 409152宽度= 55）（实际时间= 2013.885..2064.775行= 410618循环= 1）排序键：search_keyword 排序方法：quicksort内存：61381kB ->使用上rollup_day_102913 a上的date_idx_102913进行索引扫描（成本= 0.42..21036.35行= 409152宽度= 55）（实际时间= 0.026..183.370行= 410618循环= 1）索引条件：（（（created_date> =' 2018-01-01':: date）AND（created_date< ='2018-12-31':: date））过滤器：（tenant_id ='12850a62-19ac-477d-9cd7-837f3d716885':: uuid）计划时间：0.135毫秒执行时间：2760.667毫秒计划时间：0.090毫秒执行时间：2635.568毫秒

解决方案

您应该尝试使用 work_mem 的更高设置获得内存中的排序。当然，只有在计算机上有足够的内存时，您才可以拥有足够的内存。

如果您存储预先聚合的数据，则可以通过以下两种方式来加快查询速度：物化视图或第二个表以及原始表上的触发器，该触发器将使另一个表中的总和保持更新。我不知道您的数据是否可行，因为我不知道 hll_cardinality 和 hll_union_agg 是什么

I am aggregating data from a Postgres table, the query is taking approx 2 seconds which I want to reduce to less than a second.

Please find below the execution details:

Query

select a.search_keyword, hll_cardinality( hll_union_agg(a.users) ):: int as user_count, hll_cardinality( hll_union_agg(a.sessions) ):: int as session_count, sum(a.total) as keyword_count from rollup_day a where a.created_date between '2018-09-01' and '2019-09-30' and a.tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885' group by a.search_keyword order by session_count desc limit 100;

Table metadata

Total number of rows - 506527

Composite Index on columns : tenant_id and created_date

Query plan

Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1722.685..1722.694 rows=100 loops=1) Task Count: 1 Tasks Shown: All -> Task Node: host=localhost port=5454 dbname=postgres -> Limit (cost=64250.24..64250.49 rows=100 width=42) (actual time=1783.087..1783.106 rows=100 loops=1) -> Sort (cost=64250.24..64558.81 rows=123430 width=42) (actual time=1783.085..1783.093 rows=100 loops=1) Sort Key: ((hll_cardinality(hll_union_agg(sessions)))::integer) DESC Sort Method: top-N heapsort Memory: 33kB -> GroupAggregate (cost=52933.89..59532.83 rows=123430 width=42) (actual time=905.502..1724.363 rows=212633 loops=1) Group Key: search_keyword -> Sort (cost=52933.89..53636.53 rows=281055 width=54) (actual time=905.483..1351.212 rows=280981 loops=1) Sort Key: search_keyword Sort Method: external merge Disk: 18496kB -> Seq Scan on rollup_day a (cost=0.00..17890.22 rows=281055 width=54) (actual time=29.720..112.161 rows=280981 loops=1) Filter: ((created_date >= '2018-09-01'::date) AND (created_date <= '2019-09-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid)) Rows Removed by Filter: 225546 Planning Time: 0.129 ms Execution Time: 1786.222 ms Planning Time: 0.103 ms Execution Time: 1722.718 ms

What I've tried

I've tried with indexes on tenant_id and created_date but as the data is huge so it's always doing sequence scan rather than an index scan for filters. I've read about it and found, the Postgres query engine switch to sequence scan if the data returned is > 5-10% of the total rows. Please follow the link for more reference.

I've increased the work_mem to 100MB but it only improved the performance a little bit.

Any help would be really appreciated.

Update

Query plan after setting work_mem to 100MB

Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1375.926..1375.935 rows=100 loops=1) Task Count: 1 Tasks Shown: All -> Task Node: host=localhost port=5454 dbname=postgres -> Limit (cost=48348.85..48349.10 rows=100 width=42) (actual time=1307.072..1307.093 rows=100 loops=1) -> Sort (cost=48348.85..48633.55 rows=113880 width=42) (actual time=1307.071..1307.080 rows=100 loops=1) Sort Key: (sum(total)) DESC Sort Method: top-N heapsort Memory: 35kB -> GroupAggregate (cost=38285.79..43996.44 rows=113880 width=42) (actual time=941.504..1261.177 rows=172945 loops=1) Group Key: search_keyword -> Sort (cost=38285.79..38858.52 rows=229092 width=54) (actual time=941.484..963.061 rows=227261 loops=1) Sort Key: search_keyword Sort Method: quicksort Memory: 32982kB -> Seq Scan on rollup_day_104290 a (cost=0.00..17890.22 rows=229092 width=54) (actual time=38.803..104.350 rows=227261 loops=1) Filter: ((created_date >= '2019-01-01'::date) AND (created_date <= '2019-12-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid)) Rows Removed by Filter: 279266 Planning Time: 0.131 ms Execution Time: 1308.814 ms Planning Time: 0.112 ms Execution Time: 1375.961 ms

Update 2

After creating an index on created_date and increased work_mem to 120MB

create index date_idx on rollup_day(created_date);

The total number of rows is: 12,124,608

Query Plan is:

Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=2635.530..2635.540 rows=100 loops=1) Task Count: 1 Tasks Shown: All -> Task Node: host=localhost port=9702 dbname=postgres -> Limit (cost=73545.19..73545.44 rows=100 width=51) (actual time=2755.849..2755.873 rows=100 loops=1) -> Sort (cost=73545.19..73911.25 rows=146424 width=51) (actual time=2755.847..2755.858 rows=100 loops=1) Sort Key: (sum(total)) DESC Sort Method: top-N heapsort Memory: 35kB -> GroupAggregate (cost=59173.97..67948.97 rows=146424 width=51) (actual time=2014.260..2670.732 rows=296537 loops=1) Group Key: search_keyword -> Sort (cost=59173.97..60196.85 rows=409152 width=55) (actual time=2013.885..2064.775 rows=410618 loops=1) Sort Key: search_keyword Sort Method: quicksort Memory: 61381kB -> Index Scan using date_idx_102913 on rollup_day_102913 a (cost=0.42..21036.35 rows=409152 width=55) (actual time=0.026..183.370 rows=410618 loops=1) Index Cond: ((created_date >= '2018-01-01'::date) AND (created_date <= '2018-12-31'::date)) Filter: (tenant_id = '12850a62-19ac-477d-9cd7-837f3d716885'::uuid) Planning Time: 0.135 ms Execution Time: 2760.667 ms Planning Time: 0.090 ms Execution Time: 2635.568 ms

解决方案

You should experiment with higher settings of work_mem until you get an in-memory sort. Of course you can only be generous with memory if your machine has enough of it.

What would make your query way faster is if you store pre-aggregated data, either using a materialized view or a second table and a trigger on your original table that keeps the sums in the other table updated. I don't know if that is possible with your data, as I don't know what hll_cardinality and hll_union_agg are.

更多推荐

提高PostgresSQL聚合查询性能

本文发布于:2023-10-19 17:32:35，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1508262.html