在 (Py)Spark 中读取 JDBC 源时出现不受支持的数组错误?

编程入门行业动态更新时间:2024-10-28 18:30:43

本文介绍了在 (Py)Spark 中读取 JDBC 源时出现不受支持的数组错误?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

尝试将 postgreSQL DB 转换为 Dataframe .以下是我的代码:

from pyspark.sql import SparkSessionspark = SparkSession.builder \.appName("连接数据库") \.getOrCreate()jdbcUrl = "jdbc:postgresql://XXXXXX"连接属性 = {用户":"，密码" : " "，驱动程序":org.postgresql.Driver"}查询 = "(SELECT table_name FROM information_schema.tables) XXX"df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect()对于 table_name_list 中的 table_name:df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

我遇到的错误:

<块引用>

java.sql.SQLException: 在为表名生成 df2 时不支持的类型 ARRAY

如果我对表名值进行硬编码，则不会出现相同的错误

df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)

我检查了 table_name 类型，它是 String ，这是正确的方法吗?

解决方案

我猜你不想要属于 postgres 内部工作的表名，例如 pg_type, pg_policies 等，其架构类型为 pg_catalog 导致错误

<块引用>

py4j.protocol.Py4JJavaError:调用 o34.jdbc 时发生错误.: java.sql.SQLException: 不支持的类型数组

当你试图把它们读成

spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)

并且有诸如 applicable_roles、view_table_usage 等表，其架构类型为 information_schema 导致

<块引用>

py4j.protocol.Py4JJavaError:调用 o34.jdbc 时发生错误.:org.postgresql.util.PSQLException:错误:关系view_table_usage"不存在

当你试图把它们读成

spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)

可以使用上述 jdbc 命令将架构类型为 public 的表读入表中.

<块引用><块引用>

我检查了 table_name 类型，它是 String ，这是正确的方法吗?

所以你需要过滤那些表名并将你的逻辑应用为

from pyspark.sql import SparkSessionspark = SparkSession.builder \.appName("连接数据库") \.getOrCreate()jdbcUrl = "jdbc:postgresql://hostname:post/"连接属性 = {用户":"，密码" : " "，驱动程序":org.postgresql.Driver"}查询 = "information_schema.tables"df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x:x).collect()对于 table_name_list 中的 table_name:df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

应该可以

Trying to convert postgreSQL DB to Dataframe . Following is my code:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Connect to DB") \
    .getOrCreate()

jdbcUrl = "jdbc:postgresql://XXXXXX" 
connectionProperties = {
  "user" : " ",
  "password" : " ",
  "driver" : "org.postgresql.Driver"
}

query = "(SELECT table_name FROM information_schema.tables) XXX"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)

table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect() 
    for table_name in table_name_list:
          df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

Error I am getting :

java.sql.SQLException: Unsupported type ARRAY on generating df2 for table name

If I hard code table name value , I do not get the same error

df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)

I checked table_name type and it is String , is this the correct approach ?

解决方案

I guess you don't want the table names that belong to internal working of postgres such as pg_type, pg_policies etc whose schema are of type pg_catalog that causes the error

py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : java.sql.SQLException: Unsupported type ARRAY

when you try to read them as

spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)

and there are tables such as applicable_roles, view_table_usage etc whose schema are of type information_schema that causes

py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : org.postgresql.util.PSQLException: ERROR: relation "view_table_usage" does not exist

when you try to read them as

spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)

The tables whose schema types are public can be read into tables using above jdbc commands.

I checked table_name type and it is String , is this the correct approach ?

So you need to filter out those table names and apply your logic as

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Connect to DB") \
    .getOrCreate()

jdbcUrl = "jdbc:postgresql://hostname:post/" 
connectionProperties = {
  "user" : " ",
  "password" : " ",
  "driver" : "org.postgresql.Driver"
}

query = "information_schema.tables"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)

table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x: x).collect() 
    for table_name in table_name_list:
          df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

That should work

这篇关于在 (Py)Spark 中读取 JDBC 源时出现不受支持的数组错误?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-18 20:32:25，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/946490.html