计算 SQL 中的并发事件数

编程入门 行业动态 更新时间:2024-10-28 17:23:56
本文介绍了计算 SQL 中的并发事件数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个保存电话的表格,其中包含以下字段:

  • 身份证
  • 开始时间
  • 结束时间
  • 状态
  • CALL_FROM
  • CALL_TO

有 290 万条记录加载到本地 PostgreSQL 数据库中.我在 ID(唯一索引)、开始时间和结束时间上添加了索引.

在 stackoverflow 上搜索,我发现了一些有用的 SQL,并将其修改为我认为在逻辑上应该可以工作的内容.问题是查询运行了好几个小时并且永远不会返回:

SELECT T1.sid, count(*) as CountSimultaneous从calls_nov T1,calls_nov T2在哪里T2.StartTime 和 T2.EndTime 之间的 T1.StartTime和 T1.StartTime 在 '2011-11-02' 和 '2011-11-03' 之间通过...分组T1.sidORDER BY CountSimultaneous DESC;

是否有人可以建议一种方法来修复查询/索引以使其实际工作或建议另一种方法来计算并发调用?

解释计划:

排序(成本=11796758237.81..11796758679.47 行=176663 宽度=35)排序键:(count(*))->GroupAggregate(成本=0.00..11796738007.56 行=176663 宽度=35)->嵌套循环(成本=0.00..11511290152.45 行=57089217697 宽度=35)

表创建脚本:

CREATE TABLE 调用_nov (sid varchar,开始时间时间戳,结束时间时间戳,call_to varchar,call_from varchar,状态变量);

索引创建:

在 call_nov (sid) 上创建唯一索引 sid_unique_index;CREATE INDEX starttime_index on calls_nov (starttime);在calls_nov(结束时间)上创建索引endtime_index;

解决方案

1.) 您的查询没有捕获所有重叠 - 这已经被其他答案修复了.

2.) starttime 和 endtime 列的数据类型是 timestamp.所以你的 WHERE 子句也有点错误:

在2011-11-02"和2011-11-03"之间

这将包括2011-11-03 00:00".上边框必须排除.

3.) 删除了不带双引号的混合大小写语法.不带引号的标识符会自动转换为小写.简单来说:最好不要在 PostgreSQL 中使用大小写混合的标识符.

4.) 将查询转换为使用总是更可取的显式 JOIN.实际上,我将其设为 LEFT [OUTER] JOIN,因为我也想计算与其他调用不重叠的调用.

5.) 稍微简化语法以达到此基本查询:

SELECT t1.sid, count(*) AS ctFROM calls_nov t1LEFT JOIN calls_nov t2 ON t1.starttime <= t2.endtimeAND t1.endtime >= t2.starttimeWHERE t1.starttime >= '2011-11-02 0:0'::timestampAND t1.starttime <'2011-11-03 0:0'::时间戳按 1 分组按 2 DESC 排序;

对于大表,此查询非常慢,因为从 '2011-11-02' 开始的每一行都必须与整个表中的每一行进行比较,这导致(几乎)O(n²) 成本.

更快

我们可以通过预选可能的候选人来大幅降低成本.只选择您需要的列和行.我用两个 CTE 做到这一点.

  • 选择从相关日期开始的通话.->CTE x
  • 计算这些通话的最晚结束时间.(CTE 中的子查询 y)
  • 仅选择与 CTE x 的总范围重叠的调用.->CTE y
  • 最终查询比查询庞大的基础表快得多.
  • WITH x AS (选择 sid、开始时间、结束时间FROM calls_novWHERE 开始时间 >= '2011-11-02 0:0'AND 开始时间 = y.starttime按 1 分组按 2 DESC 排序;

    更快

    我有一个包含 350.000 行的真实表格,其中包含与您类似的重叠开始/结束时间戳.我将它用于快速基准测试.PostgreSQL 8.4,资源稀缺,因为它是一个测试数据库.start 和 end 上的索引.(ID 列的索引在这里无关紧要.)使用 EXPLAIN ANALYZE 进行测试,最好是 5.

    总运行时间:476994.774 毫秒

    CTE 变体:总运行时间:4199.788 毫秒——>因子 100.

    添加 多列索引 形式后:

    CREATE INDEX start_end_index on calls_nov (starttime, endtime);

    总运行时间:4159.367 毫秒

    终极速度

    如果这还不够,还有一种方法可以将其加快另一个数量级.而不是上面的 CTE,具体化临时表 - 这是关键点 - 在第二个表上创建一个 index.可能看起来像这样:

    作为一个交易执行:

    CREATE TEMP TABLE x ON COMMIT DROP AS选择 sid、开始时间、结束时间FROM calls_novWHERE 开始时间 >= '2011-11-02 0:0'AND 开始时间 = y.starttime按 1 分组按 2 DESC 排序;

    阅读手册中的临时表.p>终极解决方案

    • 创建一个封装魔法的plpgsql函数.

    • 诊断临时表的典型大小.独立创建并测量:

      选择 pg_size_pretty(pg_total_relation_size('tmp_tbl'));

    • 如果它们大于 temp_buffers 的设置 然后在您的函数中临时将它们设置得足够高,以将两个临时表都保存在 RAM 中.如果您不必交换到光盘,这是一个重大的加速.(必须首先在会话中使用临时表才能生效.)

    创建或替换函数 f_call_overlaps(date)返回表(sid varchar,ct 整数)AS$身体$宣布_from 时间戳 := $1::timestamp;_to 时间戳:= ($1 +1)::timestamp;开始SET temp_buffers = 64MB';-- 示例值;临时表的更多 RAM;CREATE TEMP TABLE x ON COMMIT DROP ASSELECT c.sid, starttime, endtime -- 避免与 OUT 参数的命名冲突FROM calls_nov cWHERE 开始时间 >= _fromAND 开始时间 <_到;CREATE TEMP TABLE y ON COMMIT DROP AS选择开始时间,结束时间FROM calls_novWHERE 结束时间 >= _fromAND starttime <= (SELECT max(endtime) FROM x);CREATE INDEX y_idx ON y(开始时间,结束时间);返回查询选择 x.sid, count(*)::int -- AS ct从 xLEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime按 1 分组按 2 DESC 排序;结尾;$BODY$ 语言 plpgsql;

    呼叫:

    SELECT * FROM f_call_overlaps('2011-11-02') -- 命名你的日期

    总运行时间:138.169 毫秒 - 这是 3000 倍

    你还能做些什么来加快速度?

    一般性能优化.

    CLUSTER calls_nov USING starttime_index;- 这也完全吸尘表分析calls_nov;

    I have a table that holds phone calls, with the following fields:

    • ID
    • STARTTIME
    • ENDTIME
    • STATUS
    • CALL_FROM
    • CALL_TO

    There are 2,9 million records loaded into a local PostgreSQL database. I added indexes on ID (unique index), starttime and endtime.

    Searching on stackoverflow, I found some useful SQL and modified it to what I think logically should work. The problem is that the query runs for many hours and never returns:

    SELECT T1.sid, count(*) as CountSimultaneous FROM calls_nov T1, calls_nov T2 WHERE T1.StartTime between T2.StartTime and T2.EndTime and T1.StartTime between '2011-11-02' and '2011-11-03' GROUP BY T1.sid ORDER BY CountSimultaneous DESC;

    Can someone please either suggest a way to fix the query/index so that it actually works or suggest another way to calculate concurrent calls?

    EDIT:

    Explain plan:

    Sort (cost=11796758237.81..11796758679.47 rows=176663 width=35) Sort Key: (count(*)) -> GroupAggregate (cost=0.00..11796738007.56 rows=176663 width=35) -> Nested Loop (cost=0.00..11511290152.45 rows=57089217697 width=35)

    Table creation script:

    CREATE TABLE calls_nov ( sid varchar, starttime timestamp, endtime timestamp, call_to varchar, call_from varchar, status varchar);

    Index creation:

    CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid); CREATE INDEX starttime_index on calls_nov (starttime); CREATE INDEX endtime_index on calls_nov (endtime);

    解决方案

    1.) Your query did not catch all overlaps - this was fixed by the other answers, already.

    2.) The data type of your columns starttime and endtime is timestamp. So your WHERE clause is slightly wrong, too:

    BETWEEN '2011-11-02' AND '2011-11-03'

    This would include '2011-11-03 00:00'. The upper border has to be excluded.

    3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don't use mixed case identifiers at all in PostgreSQL.

    4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.

    5.) Simplified the syntax a bit to arrive at this base query:

    SELECT t1.sid, count(*) AS ct FROM calls_nov t1 LEFT JOIN calls_nov t2 ON t1.starttime <= t2.endtime AND t1.endtime >= t2.starttime WHERE t1.starttime >= '2011-11-02 0:0'::timestamp AND t1.starttime < '2011-11-03 0:0'::timestamp GROUP BY 1 ORDER BY 2 DESC;

    This query is extremely slow for a big table, because every row starting on '2011-11-02' has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.


    Faster

    We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.

  • Select calls starting on the day in question. -> CTE x
  • Calculate the latest end of those calls. (subquery in CTE y)
  • Select only calls that overlap with the total range of CTE x. -> CTE y
  • The final query is much faster than querying the huge underlying table.
  • WITH x AS ( SELECT sid, starttime, endtime FROM calls_nov WHERE starttime >= '2011-11-02 0:0' AND starttime < '2011-11-03 0:0' ), y AS ( SELECT starttime, endtime FROM calls_nov WHERE endtime >= '2011-11-02 0:0' AND starttime <= (SELECT max(endtime) As max_endtime FROM x) ) SELECT x.sid, count(*) AS count_overlaps FROM x LEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime GROUP BY 1 ORDER BY 2 DESC;


    Faster yet

    I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on start and end. (Index on ID column is irrelevant here.) Tested with EXPLAIN ANALYZE, best of 5.

    Total runtime: 476994.774 ms

    CTE variant: Total runtime: 4199.788 ms -- that's > factor 100.

    After adding a multicolumn index of the form:

    CREATE INDEX start_end_index on calls_nov (starttime, endtime);

    Total runtime: 4159.367 ms


    Ultimate Speed

    If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and - this is the crucial point - create an index on the second one. Could look like this:

    Execute as one transaction:

    CREATE TEMP TABLE x ON COMMIT DROP AS SELECT sid, starttime, endtime FROM calls_nov WHERE starttime >= '2011-11-02 0:0' AND starttime < '2011-11-03 0:0'; CREATE TEMP TABLE y ON COMMIT DROP AS SELECT starttime, endtime FROM calls_nov WHERE endtime >= '2011-11-02 0:0' AND starttime <= (SELECT max(endtime) FROM x); CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens SELECT x.sid, count(*) AS ct FROM x LEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime GROUP BY 1 ORDER BY 2 DESC;

    Read about temporary tables in the manual.


    Ultimate solution
    • Create a plpgsql function that encapsulates the magic.

    • Diagnose the typical size of your temp tables. Create them standalone and measure:

      SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));

    • If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don't have to swap to disc. (Must be first use of temp tables in session to have effect.)

    CREATE OR REPLACE FUNCTION f_call_overlaps(date) RETURNS TABLE (sid varchar, ct integer) AS $BODY$ DECLARE _from timestamp := $1::timestamp; _to timestamp := ($1 +1)::timestamp; BEGIN SET temp_buffers = 64MB'; -- example value; more RAM for temp tables; CREATE TEMP TABLE x ON COMMIT DROP AS SELECT c.sid, starttime, endtime -- avoid naming conflict with OUT param FROM calls_nov c WHERE starttime >= _from AND starttime < _to; CREATE TEMP TABLE y ON COMMIT DROP AS SELECT starttime, endtime FROM calls_nov WHERE endtime >= _from AND starttime <= (SELECT max(endtime) FROM x); CREATE INDEX y_idx ON y (starttime, endtime); RETURN QUERY SELECT x.sid, count(*)::int -- AS ct FROM x LEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime GROUP BY 1 ORDER BY 2 DESC; END; $BODY$ LANGUAGE plpgsql;

    Call:

    SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date

    Total runtime: 138.169 ms -- that's factor 3000


    What else can you do to speed it up?

    General performance optimization.

    CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully ANALYZE calls_nov;

    更多推荐

    计算 SQL 中的并发事件数

    本文发布于:2023-10-16 06:57:33,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1496802.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:事件   SQL

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!