max()VS ORDER BY DESC+LIMIT 1的性能

编程入门 行业动态 更新时间:2024-10-08 23:01:26
本文介绍了max()VS ORDER BY DESC+LIMIT 1的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我今天排除了几个速度较慢的SQL查询的故障,不太了解下面的性能差异:

根据某些条件尝试从数据表中提取max(timestamp)时,如果存在匹配行,使用MAX()比ORDER BY timestamp LIMIT 1慢,但如果找不到匹配行,则使用速度快得多。

SELECT timestamp FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 4 ORDER BY timestamp DESC LIMIT 1; (0 rows) Time: 1314.544 ms SELECT timestamp FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 5 ORDER BY timestamp DESC LIMIT 1; (1 row) Time: 10.890 ms SELECT MAX(timestamp) FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 4; (0 rows) Time: 0.869 ms SELECT MAX(timestamp) FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 5; (1 row) Time: 84.087 ms

(timestamp)和(sensor_id, timestamp)上有索引,我注意到Postgres对这两种情况使用了非常不同的查询计划和索引:

QUERY PLAN (ORDER BY) -------------------------------------------------------------------------------------------------------- Limit (cost=0.43..9.47 rows=1 width=8) -> Nested Loop (cost=0.43..396254.63 rows=43823 width=8) Join Filter: (data.sensor_id = sensors.id) -> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12) -> Materialize (cost=0.00..6.70 rows=2 width=4) -> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4) Filter: (station_id = 4) (7 rows) QUERY PLAN (MAX) ---------------------------------------------------------------------------------------------------------- Aggregate (cost=3680.59..3680.60 rows=1 width=8) -> Nested Loop (cost=0.43..3571.03 rows=43823 width=8) -> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4) Filter: (station_id = 4) -> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) Index Cond: (sensor_id = sensors.id) (6 rows)

所以我的两个问题是:

  • 这种性能差异从何而来?我在这里看到了公认的答案MIN/MAX vs ORDER BY and LIMIT,但这似乎不太适用于这里。任何好的资源都将不胜感激。
  • 是否有比添加EXISTS检查更好的方法来提高所有情况下的性能(匹配行与不匹配行)?
  • 编辑以解决下面备注中的问题。我保留了上面的初始查询计划,以备将来参考:

    表定义:

    Table "public.sensors" Column | Type | Modifiers ----------------------+------------------------+----------------------------------------------------------------- id | integer | not null default nextval('sensors_id_seq'::regclass) station_id | integer | not null .... Indexes: "sensor_primary" PRIMARY KEY, btree (id) "ind_station_id" btree (station_id, id) "ind_station" btree (station_id) Table "public.data" Column | Type | Modifiers -----------+--------------------------+------------------------------------------------------------------ id | integer | not null default nextval('data_id_seq'::regclass) timestamp | timestamp with time zone | not null sensor_id | integer | not null avg | integer | Indexes: "timestamp_ind" btree ("timestamp" DESC) "sensor_ind" btree (sensor_id) "sensor_ind_timestamp" btree (sensor_id, "timestamp") "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC) 请注意,我刚才在@Erwin的建议后面添加了ind_station_idon[2-7]>。计时实际上并没有发生重大变化,仍然是ORDER BY DESC + LIMIT 1案例中的>1200ms和MAX案例中的~0.9ms。

    查询计划:

    QUERY PLAN (ORDER BY) ---------------------------------------------------------------------------------------------------------- Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1) Buffers: shared hit=3418066 read=47326 -> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1) Join Filter: (data.sensor_id = sensors.id) Buffers: shared hit=3418066 read=47326 -> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1) Buffers: shared hit=3418065 read=47326 -> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976) Buffers: shared hit=1 -> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1) Index Cond: (station_id = 4) Heap Fetches: 0 Buffers: shared hit=1 Planning time: 0.478 ms Execution time: 2161.090 ms (15 rows) QUERY (MAX) ---------------------------------------------------------------------------------------------------------- Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1) Buffers: shared hit=1 -> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1) Buffers: shared hit=1 -> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1) Index Cond: (station_id = 4) Heap Fetches: 0 Buffers: shared hit=1 -> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed) Index Cond: (sensor_id = sensors.id) Heap Fetches: 0 Planning time: 0.435 ms Execution time: 0.048 ms (13 rows)

    与前面的解释一样,ORDER BY执行Scan using timestamp_in on data,而在MAX情况下没有执行。

    Postgres版本: 来自Ubuntu Repos的帖子:PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

    请注意,有NOT NULL个约束,因此ORDER BY不必对空行进行排序。

    还请注意,我很感兴趣的是差异来自哪里。虽然不理想,但我可以使用EXISTS (<1ms),然后使用SELECT (~11ms)相对较快地检索数据。

    推荐答案

    在sensor.station_id上似乎没有索引,这在这里很可能很重要。

    max()和ORDER BY DESC + LIMIT 1之间存在实际差异。很多人似乎都错过了这一点。空值按降序排列第一个。因此ORDER BY timestamp DESC LIMIT 1返回带有timestamp IS NULL的行(如果存在),而聚合函数max()忽略空值并返回最新的非空时间戳。

    对于您的情况,由于您的列d.timestamp定义为NOT NULL(如您的更新所示),因此没有有效的区别。具有DESC NULLS LAST和ORDER BY查询的ORDER BY中相同子句的索引应该仍然最适合您。我建议使用这些索引(我下面的查询建立在第二个索引的基础上):

    sensor(station_id, id) data(sensor_id, timestamp DESC NULLS LAST)

    您可以删除其他索引变量sensor_ind_timestamp和sensor_ind_timestamp_desc,除非您有仍然需要它们的其他查询(不太可能,但可能)。

    更重要的,还有另一个困难:第一个表sensors上的过滤返回的行很少,但仍然(可能)返回多行。Postgres希望在添加的EXPLAIN输出中找到2行(rows=2)。 完美的技术是对第二个表data松散索引扫描-这在Postgres 9.4(或Postgres 9.5)中当前没有实现。您可以通过多种方式重写查询以绕过此限制。详细信息:

    • Optimize GROUP BY query to retrieve latest record per user

    最好是:

    SELECT d.timestamp FROM sensors s CROSS JOIN LATERAL ( SELECT timestamp FROM data WHERE sensor_id = s.id ORDER BY timestamp DESC NULLS LAST LIMIT 1 ) d WHERE s.station_id = 4 ORDER BY d.timestamp DESC NULLS LAST LIMIT 1;

    由于外部查询的样式大多无关紧要,您也可以只需:

    SELECT max(d.timestamp) AS timestamp FROM sensors s CROSS JOIN LATERAL ( SELECT timestamp FROM data WHERE sensor_id = s.id ORDER BY timestamp DESC NULLS LAST LIMIT 1 ) d WHERE s.station_id = 4;

    和max()变体现在的执行速度应该差不多一样快:

    SELECT max(d.timestamp) AS timestamp FROM sensors s CROSS JOIN LATERAL ( SELECT max(timestamp) AS timestamp FROM data WHERE sensor_id = s.id ) d WHERE s.station_id = 4;

    甚至最短的:

    SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp FROM sensors s WHERE station_id = 4;

    注意双括号!

    LIMIT在LATERAL联接中的另一个优点是,您可以检索所选行的任意列,而不仅仅是最新的时间戳(一列)。

    相关:

    • Why do NULL values come first when ordering DESC in a PostgreSQL query?
    • What is the difference between LATERAL and a subquery in PostgreSQL?
    • Select first row in each GROUP BY group?
    • Optimize groupwise maximum query

    更多推荐

    max()VS ORDER BY DESC+LIMIT 1的性能

    本文发布于:2023-10-23 02:21:37,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1519460.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:性能   ORDER   max   LIMIT   DESC

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!