PostgreSQL 查询通过索引扫描运行得更快，但引擎选择散列连接

编程入门行业动态更新时间:2024-10-28 12:27:06

本文介绍了PostgreSQL 查询通过索引扫描运行得更快，但引擎选择散列连接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

查询:

SELECT "replays_game".* FROM "replays_game" INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id" WHERE "replays_playeringame"."player_id" = 50027

如果我设置 SET enable_seqscan = off，那么它会做的很快，就是:

If I set SET enable_seqscan = off, then it does the fast thing, which is:

QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..27349.80 rows=3395 width=72) (actual time=28.726..65.056 rows=3398 loops=1) -> Index Scan using replays_playeringame_player_id on replays_playeringame (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.019..2.412 rows=3398 loops=1) Index Cond: (player_id = 50027) -> Index Scan using replays_game_pkey on replays_game (cost=0.00..5.41 rows=1 width=72) (actual time=0.017..0.017 rows=1 loops=3398) Index Cond: (id = replays_playeringame.game_id) Total runtime: 65.437 ms

但是没有可怕的 enable_seqscan，它选择做一个更慢的事情:

But without the dreaded enable_seqscan, it chooses to do a slower thing:

QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=7330.18..18145.24 rows=3395 width=72) (actual time=92.380..535.422 rows=3398 loops=1) Hash Cond: (replays_playeringame.game_id = replays_game.id) -> Index Scan using replays_playeringame_player_id on replays_playeringame (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.020..2.899 rows=3398 loops=1) Index Cond: (player_id = 50027) -> Hash (cost=3668.08..3668.08 rows=151208 width=72) (actual time=90.842..90.842 rows=151208 loops=1) Buckets: 1024 Batches: 32 (originally 16) Memory Usage: 1025kB -> Seq Scan on replays_game (cost=0.00..3668.08 rows=151208 width=72) (actual time=0.020..29.061 rows=151208 loops=1) Total runtime: 535.821 ms

以下是相关索引:

所以我的问题是，我做错了什么，Postgres 错误地估计了两种加入方式的相对成本?我在成本估算中看到它认为散列连接会更快.并且它对索引连接成本的估计相差了 500 倍.

So my question is, what am I doing wrong that Postgres is mis-estimating the relative costs of the two ways of joining? I see in the cost estimates that it thinks the hash-join will be faster. And its estimate of the cost of the index-join is off by a factor of 500.

我怎样才能给 Postgres 提供更多线索?在运行上述所有内容之前，我确实运行了 VACUUM ANALYZE.

How can I give Postgres more of a clue? I did run a VACUUM ANALYZE immediately before running all of the above.

有趣的是，如果我为游戏数量较少的玩家运行此查询，Postgres 会选择执行索引扫描 + 嵌套循环.因此，关于大量游戏的一些东西会刺激这种不受欢迎的行为，即相对估计成本与实际估计成本不一致.

Interestingly, if I run this query for a player with a smaller # of games, Postgres chooses to do the index-scan + nested-loop. So something about the large # of games tickles this undesired behavior where relative estimated cost is out of line with actual estimated cost.

最后，我应该使用 Postgres 吗?我不希望成为数据库调优方面的专家，因此我正在寻找一种数据库，该数据库在认真的开发人员的关注下能够表现得相当好，而不是专门的 DBA.我担心如果我坚持使用 Postgres，我会源源不断地遇到这样的问题，迫使我成为 Postgres 专家，也许另一个 DB 会更宽容地采用更随意的方法.

Finally, should I be using Postgres at all? I don't wish to become an expert in database tuning, so I'm looking for a database that will perform reasonably well with a conscientious developer's level of attention, as opposed to a dedicated DBA. I am afraid that if I stick with Postgres I will have a steady stream of issues like this that will force me to become a Postgres expert, and perhaps another DB will be more forgiving of a more casual approach.

Postgres 专家 (RhodiumToad) 审查了我的完整数据库设置(pastebin/77QuiQSp)并推荐set cpu_tuple_cost = 0.1.这带来了显着的加速:pastebin/nTHvSHVd

A Postgres expert (RhodiumToad) reviewed my full database settings (pastebin/77QuiQSp) and recommended set cpu_tuple_cost = 0.1. That gave a dramatic speedup: pastebin/nTHvSHVd

或者，切换到 MySQL 也很好地解决了这个问题.我的 OS X 机器上默认安装了 MySQL 和 Postgres，MySQL 的速度提高了 2 倍，比较了通过重复执行查询而预热"的查询.在冷"查询中，即第一次执行给定查询时，MySQL 快 5 到 150 倍.冷查询的性能对于我的特定应用程序非常重要.

Alternatively, switching to MySQL also solved the problem pretty nicely. I have a default installation of MySQL and Postgres on my OS X box, and MySQL is 2x faster, comparing queries that are "warmed up" by repeatedly executing the query. On "cold" queries, i.e. the first time a given query is executed, MySQL is 5 to 150 times faster. The performance of cold queries is pretty important for my particular application.

就我而言，最大的问题仍然悬而未决——Postgres 是否需要比 MySQL 更多的调整和配置才能运行良好?例如，请考虑这里的评论者提供的任何建议都不起作用.

The big question, as far as I'm concerned, is still outstanding -- will Postgres require more fiddling and configuration to run well than MySQL? For example, consider that none of the suggestions offered by the commenters here worked.

推荐答案

我的猜测是您正在使用默认的 random_page_cost = 4，这太高了，使得索引扫描成本太高.

My guess is that you are using the default random_page_cost = 4, which is way too high, making index scan too costly.

我尝试使用此脚本重建 2 个表:

I try to reconstruct the 2 tables with this script:

CREATE TABLE replays_game ( id integer NOT NULL, PRIMARY KEY (id) ); CREATE TABLE replays_playeringame ( player_id integer NOT NULL, game_id integer NOT NULL, PRIMARY KEY (player_id, game_id), CONSTRAINT replays_playeringame_game_fkey FOREIGN KEY (game_id) REFERENCES replays_game (id) ); CREATE INDEX ix_replays_playeringame_game_id ON replays_playeringame (game_id); -- 150k games INSERT INTO replays_game SELECT generate_series(1, 150000); -- ~150k players, ~2 games each INSERT INTO replays_playeringame select trunc(random() * 149999 + 1), generate_series(1, 150000); INSERT INTO replays_playeringame SELECT * FROM ( SELECT trunc(random() * 149999 + 1) as player_id, generate_series(1, 150000) as game_id ) AS t WHERE NOT EXISTS ( SELECT 1 FROM replays_playeringame WHERE t.player_id = replays_playeringame.player_id AND t.game_id = replays_playeringame.game_id ) ; -- the heavy player with 3000 games INSERT INTO replays_playeringame select 999999, generate_series(1, 3000);

默认值为 4:

game=# set random_page_cost = 4; SET game=# explain analyse SELECT "replays_game".* FROM "replays_game" INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id" WHERE "replays_playeringame"."player_id" = 999999; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1483.54..4802.54 rows=3000 width=4) (actual time=3.640..110.212 rows=3000 loops=1) Hash Cond: (replays_game.id = replays_playeringame.game_id) -> Seq Scan on replays_game (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.012..34.261 rows=150000 loops=1) -> Hash (cost=1446.04..1446.04 rows=3000 width=4) (actual time=3.598..3.598 rows=3000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 106kB -> Bitmap Heap Scan on replays_playeringame (cost=67.54..1446.04 rows=3000 width=4) (actual time=0.586..2.041 rows=3000 loops=1) Recheck Cond: (player_id = 999999) -> Bitmap Index Scan on replays_playeringame_pkey (cost=0.00..66.79 rows=3000 width=0) (actual time=0.560..0.560 rows=3000 loops=1) Index Cond: (player_id = 999999) Total runtime: 110.621 ms

将其降低到 2 后:

game=# set random_page_cost = 2; SET game=# explain analyse SELECT "replays_game".* FROM "replays_game" INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id" WHERE "replays_playeringame"."player_id" = 999999; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=45.52..4444.86 rows=3000 width=4) (actual time=0.418..27.741 rows=3000 loops=1) -> Bitmap Heap Scan on replays_playeringame (cost=45.52..1424.02 rows=3000 width=4) (actual time=0.406..1.502 rows=3000 loops=1) Recheck Cond: (player_id = 999999) -> Bitmap Index Scan on replays_playeringame_pkey (cost=0.00..44.77 rows=3000 width=0) (actual time=0.388..0.388 rows=3000 loops=1) Index Cond: (player_id = 999999) -> Index Scan using replays_game_pkey on replays_game (cost=0.00..0.99 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3000) Index Cond: (id = replays_playeringame.game_id) Total runtime: 28.542 ms (8 rows)

如果使用 SSD，我会进一步降低到 1.1.

If using SSD, I would lower it further to 1.1.

至于你的最后一个问题，我真的认为你应该坚持使用 postgresql.我有使用 postgresql 和 mssql 的经验，我需要在后者上付出三倍的努力，才能使其性能达到前者的一半.

As for your last question, I really think you should stick with postgresql. I have experience with postgresql and mssql, and I need to put in triple the effort into the later for it to perform half as well as the former.

更多推荐

PostgreSQL 查询通过索引扫描运行得更快,但引擎选择散列连接

本文发布于:2023-10-23 06:26:53，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1520057.html