按时间范围选择不同的用户组

编程入门行业动态更新时间:2024-10-26 18:16:20

本文介绍了按时间范围选择不同的用户组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个包含以下信息的表

I have a table with the following info

SQL创建具有测试值的表：

SQL to create table with test values:

CREATE TABLE uniques ( date DATE, user_id INT, week_beg DATE, month_beg DATE ) INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01') INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01') INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01') INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')

输入表：

| date | user_id | week_beg | month_beg | | 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 | | 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 | | 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 | | 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |

输出表：

| date | time_series | cnt | | 2013-01-01 | D | 1 | | 2013-01-01 | W | 1 | | 2013-01-01 | M | 1 | | 2013-01-03 | D | 1 | | 2013-01-03 | W | 2 | | 2013-01-03 | M | 2 | | 2013-01-06 | D | 1 | | 2013-01-06 | W | 1 | | 2013-01-06 | M | 3 | | 2013-01-07 | D | 1 | | 2013-01-07 | W | 1 | | 2013-01-07 | M | 3 |

我想计算日期中不同user_id的数量：

I want to calculate the number of distinct user_id's for a date:

该日期为止

For that date

该日期之前的那个星期（迄今为止的一周）

For that week up to that date (Week to date)

该日期为止的月份（本月至今）

For the month up to that date (Month to date)

1很容易计算。对于2和3，我尝试使用以下查询：

1 is easy to calculate. For 2 and 3 I am trying to use such queries:

SELECT date, 'W' AS "time_series", (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt" FROM user_subtitles SELECT date, 'M' AS "time_series", (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt" FROM user_subtitles

Postgres不允许窗口函数进行DISTINCT计算，因此这种方法行不通。

Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.

我也尝试了GROUP BY方法，但是它不能工作，因为它给出了整个星期/月的数字。

I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.

最好的方法是什么解决这个问题？

Whats the best way to approach this problem?

推荐答案

对所有行计数

Count all rows

SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt FROM uniques GROUP BY 1 UNION ALL SELECT DISTINCT ON (1) date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date) FROM uniques UNION ALL SELECT DISTINCT ON (1) date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date) FROM uniques ORDER BY 1, time_series

您的列 week_beg 和 month_beg 是100％冗余的，可以轻松地由 date_trunc（'week'，date + 1）-1 和 date_trunc替换（分别为月，日期）。
- Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
  
  您的一周似乎从星期日开始（减少了一个），因此 +1 ..-1 。
  
  Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
  
  具有 ORDER BY 的窗口函数的默认框架 OVER 子句中使用的c $ c>是无边界先行量和当前行之间的范围。正是您所需要的。
  
  The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
  
  使用 UNION ALL ，而不是 UNION 。
  
  您不幸的选择 time_series （D，W，M）排序不好，我重新命名以使最终的 ORDER BY 更容易。
  
  Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
  
  此查询可以处理每天有多行。计数包括一天中的所有同行。
  
  This query can deal with multiple rows per day. Counts include all peers for a day.
  
  有关 DISTINCT ON 的更多信息：
  - 在每个GROUP BY组中选择第一行？
  - Select first row in each GROUP BY group?
  要每天仅计算一次每个用户，请使用 CTE 和 DISTINCT ON ：
  
  To count every user only once per day, use a CTE with DISTINCT ON:
  WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques) SELECT date, '1_D' AS time_series, count(user_id) AS cnt FROM x GROUP BY 1 UNION ALL SELECT DISTINCT ON (1) date, '2_W' ,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1) ORDER BY date) FROM x UNION ALL SELECT DISTINCT ON (1) date, '3_M' ,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date) FROM x ORDER BY 1, 2
  动态时段内的DISTINCT用户
  您始终可以求助于相关子查询。大表往往会变慢！基于先前的查询：
  
  DISTINCT users over dynamic period of time
  You can always resort to correlated subqueries. Tend to be slow with big tables! Building on the previous queries:
  WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2) ,d AS ( SELECT date ,(date_trunc('week', date + 1)::date - 1) AS week_beg ,date_trunc('month', date)::date AS month_beg FROM uniques GROUP BY 1 ) SELECT date, '1_D' AS time_series, count(user_id) AS cnt FROM du GROUP BY 1 UNION ALL SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du WHERE du.date BETWEEN d.week_beg AND d.date ) FROM d GROUP BY date, week_beg UNION ALL SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du WHERE du.date BETWEEN d.month_beg AND d.date) FROM d GROUP BY date, month_beg ORDER BY 1,2;
  SQL Fiddle 。
  
  @Clodoaldo 来了进行重大改进：使用窗口函数 density_rank（）。这是优化版本的另一个想法。立即排除每日重复数据应该更快。性能增益随每天的行数而增长。
  
  @Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
  
  建立在简化且经过清理的数据模型上 -没有冗余将列- day 作为列名，而不是 date
  
  Building on a simplified and sanitized data model - without the redundant columns - day as column name instead of date
  
  date 是标准SQL中的保留字和PostgreSQL中的基本类型名，不应用作标识符。
  
  date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
  CREATE TABLE uniques( day date -- instead of "date" ,user_id int );
  改进的查询：
  WITH du AS ( SELECT DISTINCT ON (1, 2) day, user_id ,date_trunc('week', day + 1)::date - 1 AS week_beg ,date_trunc('month', day)::date AS month_beg FROM uniques ) SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m FROM ( SELECT user_id, day ,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w ,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m FROM du ) s GROUP BY day ORDER BY day;
  SQL小提琴演示了4种更快的变体的性能。所有这些文件的速度大约是相关子查询版本的10倍（这对相关子查询来说还不错）。
  
  SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you. All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).