接着从这个问题,当我尝试从具有多个分区的dask.dataframe创建postgresql表时,出现以下错误:
Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error:
IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index" DETAIL: Key (typname, typnamespace)=(test1, 2200) already exists. [SQL: '\nCREATE TABLE test1 (\n\t"A" BIGINT, \n\t"B" BIGINT, \n\t"C" BIGINT, \n\t"D" BIGINT, \n\t"E" BIGINT, \n\t"F" BIGINT, \n\t"G" BIGINT, \n\t"H" BIGINT, \n\t"I" BIGINT, \n\t"J" BIGINT, \n\tidx BIGINT\n)\n\n']您可以使用以下代码重新创建错误:
You can recreate the error with the following code:
import numpy as np import dask.dataframe as dd import dask import pandas as pd import sqlalchemy_utils as sqla_utils import sqlalchemy as sqla DATABASE_CONFIG = { 'driver': '', 'host': '', 'user': '', 'password': '', 'port': 5432, } DBNAME = 'dask' url = '{driver}://{user}:{password}@{host}:{port}/'.format( **DATABASE_CONFIG) db_url = url.rstrip('/') + '/' + DBNAME # create db if non-existent if not sqla_utils.database_exists(db_url): print('Creating database \'{}\''.format(DBNAME)) sqla_utils.create_database(db_url) conn = sqla.create_engine(db_url) # create pandas df with random numbers df = pd.DataFrame(np.random.randint(0,40,size=(100, 10)), columns=list('ABCDEFGHIJ')) # add index so that it can be used as primary key later on df['idx'] = df.index # create dask df ddf = dd.from_pandas(df, npartitions=4) # Write to psql dto_sql = dask.delayed(pd.DataFrame.to_sql) out = [dto_sql(d, 'test', db_url, if_exists='append', index=False, index_label='idx') for d in ddf.to_delayed()] daskpute(*out)如果将npartitions设置为1,则代码不会产生错误.因此,我猜想这与postgres无法处理并行请求以写入同一sql表有关?我该如何解决?
The code doesn't produce an error if npartitions is set to 1. So I'm guessing it has to do with postgres not being able to handle parallel requests to write to a same sql table...? How can I fix this?
推荐答案我正在阅读此. 当您通过并行处理创建/更新同一张表时,似乎会出现此错误.我了解这取决于此(如google小组讨论中所述).
I was reading this. It seems this error rises when you are creating/updating the same table with parallel processing. I understand it depends because of this (as explained on the google group discussion).
因此,我认为它取决于PostgreSQL本身,而不取决于连接驱动程序或用于多处理的模块.
So I think it depend from PostgreSQL itself and not from the connection driver or the module used for the multiprocessing.
实际上,我发现解决此问题的唯一方法是创建足够大的块,以使写入过程比计算本身慢. 使用较大的块时,此错误不会增加.
Well, Actually, the only way I found to solve this is to create chunks big enough to have back a writing process slower than the calculation itself. With bigger chunks this error doesn't rise.
更多推荐
重复键值违反唯一约束
发布评论