问题描述
限时送ChatGPT账号..尝试将 postgreSQL DB 转换为 Dataframe .以下是我的代码:
from pyspark.sql import SparkSessionspark = SparkSession.builder \.appName("连接数据库") \.getOrCreate()jdbcUrl = "jdbc:postgresql://XXXXXX"连接属性 = {用户":",密码" : " ",驱动程序":org.postgresql.Driver"}查询 = "(SELECT table_name FROM information_schema.tables) XXX"df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect()对于 table_name_list 中的 table_name:df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
我遇到的错误:
<块引用>java.sql.SQLException: 在为表名生成 df2 时不支持的类型 ARRAY
如果我对表名值进行硬编码,则不会出现相同的错误
df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)
我检查了 table_name 类型,它是 String ,这是正确的方法吗?
解决方案我猜你不想要属于 postgres 内部工作的表名,例如 pg_type
, pg_policies
等,其架构类型为 pg_catalog
导致错误
py4j.protocol.Py4JJavaError:调用 o34.jdbc 时发生错误.: java.sql.SQLException: 不支持的类型数组
当你试图把它们读成
spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)
并且有诸如 applicable_roles
、view_table_usage
等表,其架构类型为 information_schema
导致
py4j.protocol.Py4JJavaError:调用 o34.jdbc 时发生错误.:org.postgresql.util.PSQLException:错误:关系view_table_usage"不存在
当你试图把它们读成
spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)
可以使用上述 jdbc 命令将架构类型为 public 的表读入表中.
<块引用><块引用>我检查了 table_name 类型,它是 String ,这是正确的方法吗?
所以你需要过滤那些表名并将你的逻辑应用为
from pyspark.sql import SparkSessionspark = SparkSession.builder \.appName("连接数据库") \.getOrCreate()jdbcUrl = "jdbc:postgresql://hostname:post/"连接属性 = {用户":",密码" : " ",驱动程序":org.postgresql.Driver"}查询 = "information_schema.tables"df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x:x).collect()对于 table_name_list 中的 table_name:df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
应该可以
Trying to convert postgreSQL DB to Dataframe . Following is my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://XXXXXX"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "(SELECT table_name FROM information_schema.tables) XXX"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
Error I am getting :
java.sql.SQLException: Unsupported type ARRAY on generating df2 for table name
If I hard code table name value , I do not get the same error
df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)
I checked table_name type and it is String , is this the correct approach ?
解决方案I guess you don't want the table names that belong to internal working of postgres such as pg_type
, pg_policies
etc whose schema are of type pg_catalog
that causes the error
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : java.sql.SQLException: Unsupported type ARRAY
when you try to read them as
spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)
and there are tables such as applicable_roles
, view_table_usage
etc whose schema are of type information_schema
that causes
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : org.postgresql.util.PSQLException: ERROR: relation "view_table_usage" does not exist
when you try to read them as
spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)
The tables whose schema types are public can be read into tables using above jdbc commands.
I checked table_name type and it is String , is this the correct approach ?
So you need to filter out those table names and apply your logic as
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://hostname:post/"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "information_schema.tables"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
That should work
这篇关于在 (Py)Spark 中读取 JDBC 源时出现不受支持的数组错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论