如何在 EMR 主集群(由 Terraform 创建)和 Airflow 之间建立连接.我在具有相同 SG、VPC 和子网的 AWS EC2 服务器下设置了气流.
我需要解决方案,以便 Airflow 可以与 EMR 对话并执行 Spark 提交.
下面的代码将列出活动和终止的 EMR 集群,我也可以微调以获得活动集群:-
fromairflow.contrib.hooks.aws_hook import AwsHook导入 boto3hook = AwsHook(aws_conn_id='aws_default')客户端 = hook.get_client_type('emr', 'eu-central-1')对于 a 中的 x:打印(x[‘状态’][‘状态’],x[‘名称’])我的问题是 - 如何更新我上面的代码可以执行 Spark 提交操作
解决方案虽然它可能不会直接解决您的特定查询,但从广义上讲,这里有一些方法可以触发 spark-submit on (remote) EMR 通过 Airflow
使用 Apache Livy
- 这个解决方案实际上是独立于远程服务器的,即EMR
- 这是一个例子
- 缺点是 Livy 处于早期阶段,它的 API 出现 不完整和wonky对我来说
使用EmrSteps API
- 依赖于远程系统:EMR
- 健壮,但由于它本质上是异步,因此您还需要一个 EmrStepSensor(与 EmrAddStepsOperator 一起)
- 在单个 EMR 集群上,您不能同时运行多个步骤(尽管有些hacky 解决方法存在)
使用SSHHook/SSHOperator
- 再次独立于远程系统
- 相对容易上手
- 如果您的 spark-submit 命令涉及大量参数,那么(以编程方式)构建该命令会变得很麻烦
EDIT-1
好像还有一种直截了当的方式
指定远程master-IP
- 独立于远程系统
- 需要修改全局配置/环境变量
- 查看@cricket_007的回答了解详情
有用的链接
- 这个来自 @Kaxil Naik 本人:有没有办法在运行master的不同服务器上提交spark作业
- Spark通过在 Livy 上提交批处理 POST 方法并跟踪作业,使用 Airflow 提交作业
- 远程火花提交到在 EMR 上运行的 YARN
How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.
I need solutions so that Airflow can talk to EMR and execute Spark submit.
aws.amazon/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/
These blogs have understanding on execution after connection has been established.(Didn't help much)
In airflow I have made a connection using UI for AWS and EMR:-
Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-
from airflow.contrib.hooks.aws_hook import AwsHook import boto3 hook = AwsHook(aws_conn_id=‘aws_default’) client = hook.get_client_type(‘emr’, ‘eu-central-1’) for x in a: print(x[‘Status’][‘State’],x[‘Name’])My question is - How can I update my above code can do Spark-submit actions
解决方案While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow
Use Apache Livy
- This solution is actually independent of remote server, i.e., EMR
- Here's an example
- The downside is that Livy is in early stages and its API appears incomplete and wonky to me
Use EmrSteps API
- Dependent on remote system: EMR
- Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)
- On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
Use SSHHook / SSHOperator
- Again independent of remote system
- Comparatively easier to get started with
- If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome
EDIT-1
There seems to be another straightforward way
Specifying remote master-IP
- Independent of remote system
- Needs modifying Global Configurations / Environment Variables
- See @cricket_007's answer for details
Useful links
- This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master
- Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
- Remote spark-submit to YARN running on EMR
更多推荐
如何从 Airflow 向 EMR 集群提交 Spark 作业?
发布评论