如何在yarn客户端模式下在远程主节点上提交spark作业?

编程入门行业动态更新时间:2024-10-28 12:19:14

本文介绍了如何在yarn客户端模式下在远程主节点上提交spark作业?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我需要将 Spark 应用程序/作业提交到远程 Spark 集群.我目前在我的机器上有火花，主节点的 IP 地址作为纱线客户端.顺便说一句，我的机器不在集群中.我用这个命令提交我的工作

I need to submit spark apps/jobs onto a remote spark cluster. I have currently spark on my machine and the IP address of the master node as yarn-client. Btw my machine is not in the cluster. I submit my job with this command

./spark-submit --class SparkTest --deploy-mode client /home/vm/app.jar

我将我的主人的地址硬编码到我的应用程序中

I have the address of my master hardcoded into my app in the form

val spark_master = spark://IP:7077

然而我得到的只是错误

16/06/06 03:04:34 INFO AppClient$ClientEndpoint: Connecting to master spark://IP:7077... 16/06/06 03:04:34 WARN AppClient$ClientEndpoint: Failed to connect to master IP:7077 java.io.IOException: Failed to connect to /IP:7077 at org.apache.sparkwork.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.sparkwork.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.rpcty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) at org.apache.spark.rpcty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpcty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.ConnectException: Connection refused: /IP:7077

或者如果我使用

./spark-submit --class SparkTest --master yarn --deploy-mode client /home/vm/test.jar

我明白

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:251) at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:228) at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我真的需要在我的工作站中也配置 hadoop 吗?所有工作都将远程完成，这台机器不是集群的一部分.我正在使用 Spark 1.6.1.

Do I really need to have hadoop configured as well in my workstation? All the work will be done remotely and this machine is not part of the cluster. I am using Spark 1.6.1.

推荐答案

首先，如果你是从你的应用程序代码中设置 conf.setMaster(...)，它的优先级最高 (在 --master 参数上).如果要在yarn客户端模式下运行，请不要在应用程序代码中使用MASTER_IP:7077.您应该通过以下方式将 hadoop 客户端配置文件提供给您的驱动程序.

First of all, if you are setting conf.setMaster(...) from your application code, it takes highest precedence (over the --master argument). If you want to run in yarn client mode, do not use MASTER_IP:7077 in application code. You should supply hadoop client config files to your driver in the following way.

您应该设置环境变量 HADOOP_CONF_DIR 或 YARN_CONF_DIR 以指向包含客户端配置的目录.

You should set environment variable HADOOP_CONF_DIR or YARN_CONF_DIR to point to the directory which contains the client configurations.

spark.apache/docs/latest/running-on-yarn.html

根据您在 Spark 应用程序中使用的 hadoop 功能，一些配置文件将用于查找配置.如果您正在使用 hive(通过 spark-sql 中的 HiveContext)，它将查找 hive-site.xml.hdfs-site.xml 将用于从您的作业中查找 NameNode 读取/写入 HDFS 的坐标.

Depending upon which hadoop features you are using in your spark application, some of the config files will be used to lookup configuration. If you are using hive (through HiveContext in spark-sql), it will look for hive-site.xml. hdfs-site.xml will be used to lookup coordinates for NameNode reading/writing to HDFS from your job.