Postgres多列索引（整数，布尔值和数组）(Postgres multi

Postgres多列索引（整数，布尔值和数组）(Postgres multi-column index (integer, boolean, and array))

我有一个像这样的表格的Postgres 9.4数据库：

| id | other_id | current | dn_ids | rank | |----|----------|---------|---------------------------------------|------| | 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 | | 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |

（更新）我已经定义了几个索引。这是CREATE TABLE语法：

CREATE TABLE mytable ( id integer NOT NULL, other_id integer, rank integer, current boolean DEFAULT false, dn_ids integer[] DEFAULT '{}'::integer[] ); CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1; ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass); ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id); CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids); CREATE INDEX index_mytable_on_current ON mytable USING btree (current); CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id); CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);

我需要优化这样的查询：

SELECT id, dn_ids FROM mytable WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids) ORDER BY rank ASC LIMIT 500 OFFSET 1000

这个查询工作正常，但我相信它可以通过智能索引更快。表中大约有250,000行，并且我总是以current = F作为谓词。我与存储数组比较的输入数组也将有1-9个整数。 other_id可以改变。但通常，在限制之前，扫描将匹配0-25,000行。

这里有一个示例EXPLAIN ：

Limit (cost=36944.53..36945.78 rows=500 width=65) -> Sort (cost=36942.03..37007.42 rows=26156 width=65) Sort Key: rank -> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65) Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))

本网站和Postgres文档的其他答案表明，可以添加复合索引来提高性能。我已经在[other_id, current]上有一个。除了WHERE子句，我还在不同的地方看过索引可以提高ORDER BY的性能。

什么是用于此查询的正确类型的复合索引？我根本不关心太空。

我在WHERE子句中排列的术语有多重要？

I have a Postgres 9.4 database with a table like this:

| id | other_id | current | dn_ids | rank | |----|----------|---------|---------------------------------------|------| | 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 | | 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |

(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:

CREATE TABLE mytable ( id integer NOT NULL, other_id integer, rank integer, current boolean DEFAULT false, dn_ids integer[] DEFAULT '{}'::integer[] ); CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1; ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass); ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id); CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids); CREATE INDEX index_mytable_on_current ON mytable USING btree (current); CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id); CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);

I need to optimize queries like this:

SELECT id, dn_ids FROM mytable WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids) ORDER BY rank ASC LIMIT 500 OFFSET 1000

This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.

Here's an example EXPLAIN:

Limit (cost=36944.53..36945.78 rows=500 width=65) -> Sort (cost=36942.03..37007.42 rows=26156 width=65) Sort Key: rank -> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65) Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))

Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.

What's the right type of compound index to use for this query? I don't care about space at all.

Does it matter much how I order the terms in the WHERE clause?

最满意答案

什么是用于此查询的正确类型的复合索引？我根本不在乎空间。

这取决于完整的情况。无论哪种方式，您已经拥有的GIN索引在您的情况下可能优于GiST索引：

GiST和GIN指数的区别

安装附加模块btree_gin （或分别为btree_gist）后，可以将integer列与integer列组合。

多列数据类型的3个字段上的多列索引

但是，这不包括boolean数据类型，这通常作为索引列开始没有意义。只有两个（三个包含NULL ）可能的值，它不够有选择性。

而一个普通的btree索引对integer更有效。虽然在两个integer列上的多列btree索引肯定会有所帮助，但是如果多列GIN索引中的组合(other_id, dn_ids)价值高于成本(other_id, dn_ids)则必须仔细测试。可能不会。 Postgres可以将位图索引扫描中的多个索引相当有效地组合起来。

~~最后，尽管索引可以用于排序输出，但这可能不会像申请显示的查询那样付费（除非您选择大部分表格）。~~ 不适用于更新的问题。

部分索引可能是一种选择。除此之外， 你已经拥有了你需要的所有索引 。

我会完全放弃boolean列current的无意义索引，并且可能不会使用正好rank的索引。

我在WHERE子句中排列的术语有多重要？

WHERE条件的顺序完全不相关。

问题更新后的附录

索引的效用必然与选择标准有关。如果选择表中大于5％（取决于各种因素），则整个表的顺序扫描通常比处理任何索引上的开销更快 - 除了预排序输出之外 ，这是索引在这种情况下仍然有用。

对于获取25万行25,000行的查询，索引主要就是为了这一点 - 如果附加LIMIT子句，这会更有意思。一旦LIMIT满足，Postgres可以停止从索引中获取行。

请注意，Postgres总是需要读取OFFSET + LIMIT行，所以性能会随着两者的总和而恶化。

即使您添加了相关信息，大部分相关内容仍然处于黑暗中。我将假设：

您的谓词NOT (ARRAY[100,200] && dn_ids) 不是很有选择性。排除1到10个ID值通常应该保留大多数行，除非dn_ids只有很少的不同元素。最有选择性的谓词是other_id = 5 。使用NOT current消除了大部分行。 _{除此之外：

current = F
在标准Postgres中不是有效的语法。} _{必须NOT current或current = FALSE ;}

虽然GIN索引可以很好地识别匹配数组的几行比任何其他索引类型更快，但这似乎与您的查询无关。我最好的猜测是这个部分的多列btree索引 ：

CREATE INDEX foo ON mytable (other_id, rank, dn_ids) WHERE NOT current;

btree索引中的数组列dn_ids不支持&&运算符，我只是将它包含在允许仅索引扫描和筛选行之前访问堆（表）。如果索引中没有dn_ids ，甚至可能会更快：

CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;

由于这个新特性， GiST索引在Postgres 9.5中可能会变得更有趣：

允许GiST索引执行仅索引扫描（Anastasia Lubennikova，Heikki Linnakangas，Andreas Karlsson）

另外： current是标准SQL中的保留字，即使它在Postgres中被允许作为标识符也是如此。 Aside 2：我假设id是一个列的默认设置的实际serial列。只是创建一个像你演示的序列，什么都不会做。

自动增加SQL函数

What's the right type of compound index to use for this query? I don't care about space at all.

This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:

Difference between GiST and GIN index

You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).

Multicolumn index on 3 fields with heterogenous data types

However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.

And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.

~~Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).~~ Not applicable to updated question.

Partial indexes might be an option. Other than that, you already have all the indexes you need.

I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.

Does it matter much how I order the terms in the WHERE clause?

The order of WHERE conditions is completely irrelevant.

Addendum after question update

The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.

For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.

Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.

Even with your added information, much of what's relevant is still in the dark. I am going to assume that:

Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids. The most selective predicate is other_id = 5. A substantial part of the rows is eliminated with NOT current. _{Aside:

current = F
isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;}

While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:

CREATE INDEX foo ON mytable (other_id, rank, dn_ids) WHERE NOT current;

The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:

CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;

GiST indexes may become more interesting in Postgres 9.5 due to this new feature:

Allow GiST indexes to perform index-only scans (Anastasia Lubennikova, Heikki Linnakangas, Andreas Karlsson)

Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres. Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.

Auto increment SQL function

更多推荐