我想使用Airflow来编排工作,包括运行一些猪脚本,shell脚本和Spark作业。
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
主要用于Spark作业,我想使用Apache Livy但不确定是否使用或运行spark-submit是个好主意。
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
即使我提交了使用Airflow跟踪Spark作业的最佳方法是什么? >
What is best way to track Spark job using Airflow if even I submitted?
推荐答案我假设您是包含 Java的应用程序 JAR / Scala 您要提交到远程 Spark 集群的代码。在比较其他可能性时, Livy 可以说是远程 spark-submit 的最佳选择:
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
- 指定远程 master IP :需要修改全局配置/环境变量
- 使用 SSHOperator : SSH 连接可能会中断
- 使用 EmrAddStepsOperator :取决于 EMR
- Specifying remote master IP: Requires modifying global configurations / environment variables
- Using SSHOperator: SSH connection might break
- Using EmrAddStepsOperator: Dependent on EMR
关于跟踪
- Livy 仅报告 状态,而不报告进度(阶段完成百分比)
- 如果可以的话,您可以通过 poll
- 服务器 REST API并在控制台中保留打印日志,这些日志将显示在WebUI的任务日志中(查看日志)
- Livy only reports state and not progress (% completion of stages)
- If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
其他注意事项
- Livy 不支持为 POST /批次请求重用 SparkSession
- 如果必须这样做,则必须在 PySpark 中编写应用程序代码,并使用 POST /会话请求
- Livy doesn't support reusing SparkSession for POST/batches request
- If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
参考文献
- 如何从Airflow向EMR群集提交Spark作业?
- livy / examples / pi_app
- rssanders3 / livy_spark_operator_python_example >
- How to submit Spark jobs to EMR cluster from Airflow?
- livy/examples/pi_app
- rssanders3/livy_spark_operator_python_example
有用的链接
- 如何从Airflow向EMR群集提交Spark作业?
- 将火花远程提交给在EMR上运行的YARN
- How to submit Spark jobs to EMR cluster from Airflow?
- Remote spark-submit to YARN running on EMR
更多推荐
通过在Livy上提交批POST方法并跟踪作业来使用Airflow触发作业提交
发布评论