在EMR上将Postgresql JDBC源与Apache Spark结合使用

编程入门 行业动态 更新时间:2024-10-26 16:35:55
本文介绍了在EMR上将Postgresql JDBC源与Apache Spark结合使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在运行现有的EMR集群,并希望从Postgresql DB源创建DF.

I have existing EMR cluster running and wish to create DF from Postgresql DB source.

为此,您似乎需要使用更新后的spark.driver.extraClassPath修改spark-defaults.conf并指向已经在master& amp;上下载的相关PostgreSQL JAR.从属节点,或,您可以将它们作为参数添加到spark-submit作业中.

To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.

由于我想使用现有的Jupyter笔记本来处理数据,而不是真正希望重新启动集群,所以解决此问题的最有效方法是什么?

Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?

我尝试了以下操作:

  • 创建新目录(在主目录和从属目录上为/usr/lib/postgresql/,并将PostgreSQL jar复制到该目录.(postgresql-9.41207.jre6.jar)

  • Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)

    编辑了spark-default.conf以包含通配符位置

    Edited spark-default.conf to include wildcard location

    spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$

  • 试图使用以下代码在Jupyter单元中创建数据框:

  • Tried to create dataframe in Jupyter cell using the following code:

    SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password" spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})

  • 我收到如下所示的Java错误:

    I get a Java error as per below:

    Py4JJavaError: An error occurred while calling o396.jdbc. : java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver

    帮助表示赞赏.

    推荐答案

    我认为您不需要在从属服务器中复制postgres jar,因为驱动程序和集群管理器会处理所有事情.我通过以下方式从Postgres外部源创建了数据框:

    I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:

    下载postgres驱动程序jar :

    cd $HOME && wget jdbc.postgresql/download/postgresql-42.2.5.jar

    创建数据框:

    atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \ .format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>), 'database' : <db>, 'dbtable' : <select * from table>} df=spark.read.format('jdbc').options(**attribute).load()

    提交工作: 提交Spark作业时,将下载的jar添加到驱动程序类路径.

    Submit to spark job: Add the the downloaded jar to driver class path while submitting the spark job.

    --properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5

    更多推荐

    在EMR上将Postgresql JDBC源与Apache Spark结合使用

    本文发布于:2023-10-26 11:03:31,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1529945.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:上将   Postgresql   EMR   Spark   Apache

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!