无法将Spark作业从Windows IDE提交到Linux群集

编程入门 行业动态 更新时间:2024-10-26 16:29:15
本文介绍了无法将Spark作业从Windows IDE提交到Linux群集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我刚刚阅读了 findspark 和发现它很有趣,因为到目前为止,我只使用了spark-submit,它不适合在IDE上进行交互式开发.我尝试在Windows 10,Anaconda 4.4.0,Python 3.6.1,IPython 5.3.0,Spyder 3.1.4,Spark 2.1.1中执行此文件:

I just read about findspark and found it quite interesting, as so far I have only used spark-submit which isn't be suited for interactive development on an IDE. I tried executing this file on Windows 10, Anaconda 4.4.0, Python 3.6.1, IPython 5.3.0, Spyder 3.1.4, Spark 2.1.1:

def inc(i): return i + 1 import findspark findspark.init() import pyspark sc = pyspark.SparkContext(master='local', appName='test1') print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

Spyder生成命令runfile('C:/tests/temp1.py', wdir='C:/tests'),并按预期打印出[1, 2, 3, 4, 5, 6, 7, 8, 9, 10].但是,如果我尝试使用在Ubuntu上运行的Spark集群,则会收到错误消息:

Spyder generates the command runfile('C:/tests/temp1.py', wdir='C:/tests') and it prints out [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] as expected. However if I try to use a Spark cluster running on Ubuntu I get an error:

def inc(i): return i + 1 import findspark findspark.init() import pyspark sc = pyspark.SparkContext(master='spark://192.168.1.57:7077', appName='test1') print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

IPython错误:

IPython errors:

Traceback (most recent call last): File "<ipython-input-1-820bd4275b8c>", line 1, in <module> runfile('C:/tests/temp.py', wdir='C:/tests') File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile execfile(filename, namespace) File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/tests/temp.py", line 11, in <module> print(repr(sc.parallelize(tuple(range(10))).map(inc).collect())) File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\pyspark\rdd.py", line 808, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling

Worker stderr:

Worker stderr:

ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.IOException: Cannot run program "C:\Anaconda3\pythonw.exe": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) at org.apache.spark.api.python.PythonRunnerpute(PythonRDD.scala:128) at org.apache.spark.api.python.PythonRDDpute(PythonRDD.scala:63) at org.apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

由于某种原因,这试图在Linux从站上使用Windows二进制路径.任何想法如何克服这一点?我在Spyder上的Python控制台上得到了相同的结果,除了错误是Cannot run program "C:\Anaconda3\python.exe": error=2, No such file or directory.实际上,在运行python temp.py时,它也是从命令行发生的.

For some reason this is trying to use a Windows binary path on Linux slave. Any ideas how to overcome this? I get the same outcome with Python console on Spyder, except the error is Cannot run program "C:\Anaconda3\python.exe": error=2, No such file or directory. Actually it happends from command line as well, when running python temp.py.

即使从Windows提交到Linux,此版本也可以正常工作

This version works fine even when submitted from Windows to Linux:

def inc(i): return i + 1 import pyspark sc = pyspark.SparkContext(appName='test2') print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

spark-submit --master spark://192.168.1.57:7077 temp2.py

推荐答案

我找到了解决方案,事实证明这很简单. pyspark/context.py 使用env变量PYSPARK_PYTHON确定Python可执行文件的路径,但默认为正确的" python.但是,默认情况下,findspark 覆盖要匹配的env变量sys.executable,显然不能跨平台使用.

I found the solution, which turned out to be very simple. pyspark/context.py uses env variable PYSPARK_PYTHON to determine the Python executable's path, but defaults to the "correct" python. However by default findspark overrides this env variable to match sys.executable, which clearly won't work cross-platform.

无论如何,以下是工作代码,以供将来参考:

Anyway here is the working code for future reference:

def inc(i): return i + 1 import findspark findspark.init(python_path='python') # <-- so simple! import pyspark sc = pyspark.SparkContext(master='spark://192.168.1.57:7077', appName='test1') print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))

更多推荐

无法将Spark作业从Windows IDE提交到Linux群集

本文发布于:2023-11-24 03:16:38,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1623773.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:作业   Spark   Windows   Linux   IDE

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!