用于查找具有特定数量关联的行的 SQL 查询

编程入门行业动态更新时间:2024-10-27 05:30:21

本文介绍了用于查找具有特定数量关联的行的 SQL 查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

使用 Postgres，我有一个包含 conversations 和 conversationUsers 的架构.每个conversation 都有很多conversationUsers.我希望能够找到具有确切指定数量的 conversationUsers 的对话.换句话说，如果提供了一个 userIds 数组(例如，[1, 4, 6])，我希望能够找到仅包含这些用户的对话，并且不再.

到目前为止，我已经尝试过:

选择 c."conversationId"来自conversationUsers"cWHERE c."userId" IN (1, 4)GROUP BY c."conversationId"有 COUNT(c."userId") = 2;

不幸的是，这似乎也返回了包括这 2 个用户在内的对话.(例如，如果会话还包含 "userId" 5，则返回结果).

解决方案

这是一个关系分部 - 添加了特殊要求，即同一个对话不应有其他用户.

假设是表 "conversationUsers" 的 PK，它强制组合的唯一性，NOT NULL 并且还隐式提供了对性能至关重要的索引.多列PK的列按this的顺序！否则你必须做更多.关于索引列的顺序:

复合索引是否也适用于第一个字段的查询?

对于基本查询，有 "brute force" 方法来计算所有给定用户的 all 对话的匹配用户数量，然后过滤匹配的用户所有给定的用户.对于小型表格和/或只有较短的输入数组和/或每个用户的对话很少，但不能很好地扩展:

SELECT "conversationId"来自conversationUsers"cWHERE "userId" = ANY ('{1,4,6}'::int[])按 1 分组HAVING count(*) = array_length('{1,4,6}'::int[], 1)不存在(从对话用户"中选择WHERE "conversationId" = c."conversationId"AND用户ID"<>ALL('{1,4,6}'::int[]));

使用 NOT EXISTS 反半联接消除与其他用户的对话.更多:

我该怎么做(或者我可以)在多列上选择 DISTINCT 吗?

替代技术:

选择不存在的行其他表

还有其他各种(更快)关系划分查询技术.但最快的那些并不适合动态数量的用户 ID.

如何过滤SQL 会产生多通关系

对于也可以处理动态数量的用户 ID 的快速查询，请考虑使用递归 CTE:

带有递归 rcte AS (选择conversationId"，1 作为 idx来自对话用户"WHERE "userId" = ('{1,4,6}'::int[])[1]联合所有选择 c."conversationId", r.idx + 1发件人加入conversationUsers" c 使用(conversationId")WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1])选择对话 ID"发件人WHERE idx = array_length(('{1,4,6}'::int[]), 1)不存在(从对话用户"中选择WHERE "conversationId" = r."conversationId"AND用户ID"<>ALL('{1,4,6}'::int[]));

为了便于使用，将其包装在一个函数或准备好的语句中.喜欢:

准备对话(int[]) AS使用递归 rcte AS (选择conversationId"，1 作为 idx来自对话用户"WHERE 用户 ID" = $1[1]联合所有选择 c."conversationId", r.idx + 1发件人加入conversationUsers" c 使用(conversationId")其中 c."userId" = $1[idx + 1])选择对话 ID"发件人WHERE idx = array_length($1, 1)不存在(从对话用户"中选择WHERE "conversationId" = r."conversationId"AND用户ID"<>全部($1)；

呼叫:

执行对话('{1,4,6}');

db<>小提琴这里(也演示一个函数)

仍有改进的余地:要获得最佳性能，您必须将会话最少的用户放在输入数组的首位，以便尽早消除尽可能多的行.要获得最佳性能，您可以动态生成非动态、非递归查询(使用第一个链接中的一种 fast 技术)并依次执行.您甚至可以使用动态 SQL 将其包装在单个 plpgsql 函数中......

更多解释:

在 WHERE 子句中多次使用同一列

替代方案:稀疏写表的 MV

如果表 "conversationUsers" 大部分是只读的(旧对话不太可能更改)，您可以使用 MATERIALIZED VIEW 在排序数组中预先聚合用户，并在该数组列上创建一个普通的 btree 索引.

CREATE MATERIALIZED VIEW mv_conversation_users ASSELECT "conversationId", array_agg("userId") AS users -- 排序数组从 (选择对话 ID"、用户 ID"来自对话用户"按 1、2 订购) 子按 1 分组按 1 排序；在 mv_conversation_users (users) 上创建索引 INCLUDE ("conversationId");

演示的覆盖索引需要 Postgres 11.请参阅:

dba.stackexchange/a/207938/3684

关于子查询中的行排序:

如何结合聚合函数应用 ORDER BY 和 LIMIT?

在旧版本中，在 (users, "conversationId") 上使用普通的多列索引.对于非常长的数组，哈希索引在 Postgres 10 或更高版本中可能有意义.

那么更快的查询就是:

SELECT "conversationId"FROM mv_conversation_users cWHERE users = '{1,4,6}'::int[];-- 排序数组！

db<>小提琴这里p>

您必须权衡增加的存储、写入和维护成本与读取性能的好处.

另外:考虑不带双引号的合法标识符.conversation_id 而不是 "conversationId" 等:

PostgreSQL 列名是否区分大小写?

Using Postgres I have a schema that has conversations and conversationUsers. Each conversation has many conversationUsers. I want to be able to find the conversation that has the exactly specified number of conversationUsers. In other words, provided an array of userIds (say, [1, 4, 6]) I want to be able to find the conversation that contains only those users, and no more.

So far I've tried this:

SELECT c."conversationId" FROM "conversationUsers" c WHERE c."userId" IN (1, 4) GROUP BY c."conversationId" HAVING COUNT(c."userId") = 2;

Unfortunately, this also seems to return conversations which include these 2 users among others. (For example, it returns a result if the conversation also includes "userId" 5).

解决方案

This is a case of relational-division - with the added special requirement that the same conversation shall have no additional users.

Assuming is the PK of table "conversationUsers" which enforces uniqueness of combinations, NOT NULL and also provides the index essential for performance implicitly. Columns of the multicolumn PK in this order! Else you have to do more. About the order of index columns:

Is a composite index also good for queries on the first field?

For the basic query, there is the "brute force" approach to count the number of matching users for all conversations of all given users and then filter the ones matching all given users. OK for small tables and/or only short input arrays and/or few conversations per user, but doesn't scale well:

SELECT "conversationId" FROM "conversationUsers" c WHERE "userId" = ANY ('{1,4,6}'::int[]) GROUP BY 1 HAVING count(*) = array_length('{1,4,6}'::int[], 1) AND NOT EXISTS ( SELECT FROM "conversationUsers" WHERE "conversationId" = c."conversationId" AND "userId" <> ALL('{1,4,6}'::int[]) );

Eliminating conversations with additional users with a NOT EXISTS anti-semi-join. More:

How do I (or can I) SELECT DISTINCT on multiple columns?

Alternative techniques:

Select rows which are not present in other table

There are various other, (much) faster relational-division query techniques. But the fastest ones are not well suited for a dynamic number of user IDs.

How to filter SQL results in a has-many-through relation

For a fast query that can also deal with a dynamic number of user IDs, consider a recursive CTE:

WITH RECURSIVE rcte AS ( SELECT "conversationId", 1 AS idx FROM "conversationUsers" WHERE "userId" = ('{1,4,6}'::int[])[1] UNION ALL SELECT c."conversationId", r.idx + 1 FROM rcte r JOIN "conversationUsers" c USING ("conversationId") WHERE c."userId" = ('{1,4,6}'::int[])[idx + 1] ) SELECT "conversationId" FROM rcte r WHERE idx = array_length(('{1,4,6}'::int[]), 1) AND NOT EXISTS ( SELECT FROM "conversationUsers" WHERE "conversationId" = r."conversationId" AND "userId" <> ALL('{1,4,6}'::int[]) );

For ease of use wrap this in a function or prepared statement. Like:

PREPARE conversations(int[]) AS WITH RECURSIVE rcte AS ( SELECT "conversationId", 1 AS idx FROM "conversationUsers" WHERE "userId" = $1[1] UNION ALL SELECT c."conversationId", r.idx + 1 FROM rcte r JOIN "conversationUsers" c USING ("conversationId") WHERE c."userId" = $1[idx + 1] ) SELECT "conversationId" FROM rcte r WHERE idx = array_length($1, 1) AND NOT EXISTS ( SELECT FROM "conversationUsers" WHERE "conversationId" = r."conversationId" AND "userId" <> ALL($1);

Call:

EXECUTE conversations('{1,4,6}');

db<>fiddle here (also demonstrating a function)

There is still room for improvement: to get top performance you have to put users with the fewest conversations first in your input array to eliminate as many rows as possible early. To get top performance you can generate a non-dynamic, non-recursive query dynamically (using one of the fast techniques from the first link) and execute that in turn. You could even wrap it in a single plpgsql function with dynamic SQL ...

More explanation:

Using same column multiple times in WHERE clause

Alternative: MV for sparsely written table

If the table "conversationUsers" is mostly read-only (old conversations are unlikely to change) you might use a MATERIALIZED VIEW with pre-aggregated users in sorted arrays and create a plain btree index on that array column.

CREATE MATERIALIZED VIEW mv_conversation_users AS SELECT "conversationId", array_agg("userId") AS users -- sorted array FROM ( SELECT "conversationId", "userId" FROM "conversationUsers" ORDER BY 1, 2 ) sub GROUP BY 1 ORDER BY 1; CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");

The demonstrated covering index requires Postgres 11. See:

dba.stackexchange/a/207938/3684

About sorting rows in a subquery:

How to apply ORDER BY and LIMIT in combination with an aggregate function?

In older versions use a plain multicolumn index on (users, "conversationId"). With very long arrays, a hash index might make sense in Postgres 10 or later.

Then the much faster query would simply be:

SELECT "conversationId" FROM mv_conversation_users c WHERE users = '{1,4,6}'::int[]; -- sorted array!

db<>fiddle here

You have to weigh added costs to storage, writes and maintenance against benefits to read performance.

Aside: consider legal identifiers without double quotes. conversation_id instead of "conversationId" etc.: