如何在PostgreSQL中进行大型非阻塞更新?

编程入门行业动态更新时间:2024-10-25 22:35:20

本文介绍了如何在PostgreSQL中进行大型非阻塞更新?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我想对PostgreSQL中的表进行大的更新，但是我不需要在整个操作中维护事务的完整性，因为我知道我要更改的列将不会写入或在更新过程中阅读.我想知道在psql控制台中是否有一种简便的方法使这些类型的操作更快.

I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.

例如，假设我有一个名为"orders"的表，该表具有3500万行，而我想这样做:

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:

UPDATE orders SET status = null;

为了避免转移到主题之外的讨论，我们假设当前3500万列的所有status值都设置为相同(非null)值，从而使索引无用.

To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.

此语句的问题是生效需要很长时间(仅是由于锁定)，并且所有更改的行都被锁定，直到完成整个更新为止.此更新可能需要5个小时，而类似

The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like

UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);

可能需要1分钟.超过3500万行，执行上述操作并将其分成35个块，仅需35分钟，可为我节省4个小时25分钟.

might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.

我可以使用脚本(在此处使用伪代码)将其进一步分解:

I could break it down even further with a script (using pseudocode here):

for (i = 0 to 3500) { db_operation ("UPDATE orders SET status = null WHERE (order_id >" + (i*1000)" + " AND order_id <" + ((i+1)*1000) " + ")"); }

此操作可能仅需几分钟即可完成，而不是35分钟.

This operation might complete in only a few minutes, rather than 35.

因此，这取决于我真正要问的问题.我不想每次都想编写一个大的一次性更新来编写怪异的脚本来破坏操作.有没有一种方法可以完全在SQL中完成我想要的工作?

So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?

推荐答案

列/行

...我不需要在整个系统之间维护事务完整性整个操作，因为我知道我要更改的列是不会在更新过程中被写入或读取.

... I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update.

PostgreSQL的MVCC模型中的任何UPDATE都写了一个整行的新版本.如果并发事务更改同一行的 any 列，则会出现耗时的并发问题. 手册中的详细信息.并发事务不会触及相同的列，从而避免了一些可能的复杂性，但没有其他复杂性.

Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.

为了避免转移到主题之外的讨论中，我们假设目前已设置3500万列的所有状态值设置为相同的值(非null)，从而使索引无用.

在更新整个表 (或其主要部分)时，Postgres 从不使用索引 .当必须读取所有或大多数行时，顺序扫描会更快.相反:索引维护意味着UPDATE的额外费用.

When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.

例如，假设我有一个名为"orders"的表，其中有3500万行，我想这样做:

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this: UPDATE orders SET status = null;

我了解您的目标是寻求更通用的解决方案(请参见下文).但是要解决的实际问题:无论表大小如何，都可以在毫秒内处理. :

I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:

ALTER TABLE orders DROP column status , ADD column status text;

每个文档:

当用ADD COLUMN添加一列时，表中所有现有的行用列的默认值初始化(如果没有DEFAULT，则使用NULL 子句已指定).如果没有DEFAULT子句，那么这仅仅是元数据更改...

When a column is added with ADD COLUMN, all existing rows in the table are initialized with the column's default value (NULL if no DEFAULT clause is specified). If there is no DEFAULT clause, this is merely a metadata change...

并且:

DROP COLUMN表单不会物理删除列，而只是删除使它对SQL操作不可见.随后的插入和更新表中的操作将为该列存储一个空值.因此，删除列很快，但不会立即减少表的磁盘大小，作为删除磁盘所占用的空间列未回收.随着时间的推移，该空间将被回收现有行将更新. (这些声明不适用于以下情况: 删除系统oid列；立即重写.)

The DROP COLUMN form does not physically remove the column, but simply makes it invisible to SQL operations. Subsequent insert and update operations in the table will store a null value for the column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over time as existing rows are updated. (These statements do not apply when dropping the system oid column; that is done with an immediate rewrite.)

确保根据列(外键约束，索引，视图等)没有对象.您将需要删除/重新创建它们.除非这样，否则系统目录表pg_attribute上的微小操作即可完成.在表上需要排他锁，这可能会导致大量并行负载.由于只需要几毫秒，因此您应该还可以.

Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. Since it only takes a few milliseconds, you should still be fine.

如果您要保留默认的列，请在单独的命令中将其添加回 .在同一命令中执行此操作将立即将其应用于所有行，从而使效果无效.然后，您可以更新批次中的现有列. 按照文档链接，并阅读手册中的注释.

If you have a column default you want to keep, add it back in a separate command. Doing it in the same command would apply it to all rows immediately, voiding the effect. You could then update the existing columns in batches. Follow the documentation link and read the Notes in the manual.

dblink 已经在另一个答案中提到.它允许在隐式的单独连接中访问远程" Postgres数据库. 远程"数据库可以是当前数据库，从而实现自治事务" :该函数在远程"数据库中写入的内容已提交且无法回滚.

dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.

这允许运行一个函数，该函数以较小的部分更新一个大表，并且每个部分分别提交.避免增加大量行的事务开销，更重要的是，在每个部分之后释放锁.这样一来，并发操作就可以进行，而不会造成太多延迟，并且使死锁的可能性降低.

This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.

如果您没有并发访问权限，那么这几乎没有用-除了在出现异常后避免ROLLBACK之外.对于这种情况，还要考虑 SAVEPOINT .

If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.

首先，许多小额交易实际上更昂贵. 仅适用于大表 .最佳位置取决于许多因素.

First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.

如果您不确定自己在做什么: 单笔交易是安全的方法 .为了使其正常工作，必须同时执行表上的并发操作.例如:并发 writes 可以将行移动到应该已经处理的分区.或同时读取可能会看到不一致的中间状态. 您已被警告.

If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.

需要首先安装附加模块dblink:

The additional module dblink needs to be installed first:

如何在PostgreSQL中使用(安装)dblink? /a>

使用dblink设置连接在很大程度上取决于数据库集群的设置和适当的安全策略.这可能很棘手.稍后，通过更多如何连接dblink 的相关答案:

Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:

即使函数中止，也会在UDF中持久插入

按照那里的说明创建 FOREIGN SERVER 和 USER MAPPING ，以简化和简化连接(除非您已经拥有一个). 假设serial PRIMARY KEY带有或不带有一些间隙.

Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already). Assuming a serial PRIMARY KEY with or without some gaps.

CREATE OR REPLACE FUNCTION f_update_in_steps() RETURNS void AS $func$ DECLARE _step int; -- size of step _cur int; -- current ID (starting with minimum) _max int; -- maximum ID BEGIN SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders; -- 100 slices (steps) hard coded _step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small -- +1 to avoid endless loop for 0 PERFORM dblink_connect('myserver'); -- your foreign server as instructed above FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max PERFORM dblink_exec( $$UPDATE public.orders SET status = 'foo' WHERE order_id >= $$ || _cur || $$ AND order_id < $$ || _cur + _step || $$ AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update _cur := _cur + _step; EXIT WHEN _cur > _max; -- stop when done (never loop till 200) END LOOP; PERFORM dblink_disconnect(); END $func$ LANGUAGE plpgsql;

致电:

SELECT f_update_in_steps();

您可以根据需要对任何部分进行参数化:表名，列名，值...，只需确保清理标识符以避免SQL注入:

You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection: