如何将诸如 spark

编程入门 行业动态 更新时间:2024-10-12 18:21:43
本文介绍了如何将诸如 spark-sftp 之类的任何新库添加到我的 Pyspark 代码中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时送ChatGPT账号..

当我尝试在 Spark conf 中设置包依赖项spark-sftp"时,我收到 ClassNotFoundException.但是当我使用以下命令执行脚本时它有效:

When Im trying to set a package dependency "spark-sftp" in my Spark conf, I get ClassNotFoundException. But it works when i execute the script using:

spark-submit --packages com.springml:spark-sftp_2.11:1.1.1 test.py

spark-submit --packages com.springml:spark-sftp_2.11:1.1.1 test.py

下面是我的代码.有人能告诉我如何执行我的 pyspark 脚本吗通过不将包作为参数传递给 spark-submit?

Below is my code. Can someone tell me how can i execute my pyspark script by without passing the package as an argument to spark-submit?

import sys
import datetime
import pyspark
from pyspark.sql import *
from pyspark.sql import SparkSession, SQLContext, Row, HiveContext
from pyspark import SparkContext

#Create new config
conf = (pyspark.conf.SparkConf()
.set("spark.driver.maxResultSize", "16g")
.set("spark.driver.memory", "20g")
.set("spark.executor.memory", "20g")
.set("spark.executor.cores", "5")
.set("spark.shuffle.service.enabled", "true")
.set("spark.dynamicAllocation.enabled", "true")
.set("spark.dynamicAllocation.initialExecutors", "24")
.set("spark.dynamicAllocation.minExecutors", "6")
.set("spark.submit.deployMode", "client")
.set("spark.jars.packages", "com.springml:spark-sftp_2.11:1.1.1")
.set("spark.python.worker.memory", "4g")
.set("spark.default.parallelism", "960")
.set("spark.executor.memoryOverhead", "4g")
.setMaster("yarn-client"))


# Create new context
spark = SparkSession.builder.appName("AppName").config(conf=conf).enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")


df = spark.read.format("com.springml.spark.sftp").option("host", "HOST").option("username", "HOSTNAME").option("password", "pass").option("fileType", "csv").option("inferSchema", "true").load("/test/sample.csv")

输出::java.lang.ClassNotFoundException:无法找到数据源:com.springml.spark.sftp.请在 http://spark.apache/third-party-projects 中找到软件包.html在 org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)在 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)在 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在 java.lang.reflect.Method.invoke(Method.java:498)在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在 py4j.Gateway.invoke(Gateway.java:282)在 py4jmands.AbstractCommand.invokeMethod(AbstractCommand.java:132)在 py4jmands.CallCommand.execute(CallCommand.java:79)在 py4j.GatewayConnection.run(GatewayConnection.java:238)在 java.lang.Thread.run(Thread.java:748)引起:java.lang.ClassNotFoundException:com.springml.spark.sftp.DefaultSource

Output: : java.lang.ClassNotFoundException: Failed to find data source: com.springml.spark.sftp. Please find packages at http://spark.apache/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4jmands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4jmands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.springml.spark.sftp.DefaultSource

推荐答案

在提交 Spark 作业时,您可以指定要安装的包.为此,您可以将此 maven 依赖项指定为:

While submitting spark job, you can specify what packages to install. For this one you can specify this maven dependency as:

> $SPARK_HOME/bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.3

这篇关于如何将诸如 spark-sftp 之类的任何新库添加到我的 Pyspark 代码中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

更多推荐

[db:关键词]

本文发布于:2023-04-18 20:42:14,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/947134.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:如何将   spark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!