使用Airflow dag运行EMR集群,一旦完成任务,EMR将终止

编程入门 行业动态 更新时间:2024-10-12 14:20:23
本文介绍了使用Airflow dag运行EMR集群,一旦完成任务,EMR将终止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有Airflow作业,这些作业在EMR群集上运行良好。我需要的是,假设我有4个气流作业,而这些作业需要一个EMR群集,例如20分钟才能完成任务。为什么我们不能在DAG运行时创建一个EMR集群,一旦工作完成,它将终止创建的EMR集群。

I have Airflow jobs, which are running fine on the EMR cluster. what I need is, let's say if I have a 4 airflow jobs which required an EMR cluster for let's say 20 min to complete the task. why not we can create an EMR cluster at DAG run time and once the job is to finish it will terminate the created an EMR cluster.

推荐答案

当然,那将是最有效地利用资源。让我警告您:这里有很多细节;我会尝试列出尽可能多的清单。我鼓励您添加自己的综合答案,列出遇到的任何问题以及解决方法(一旦您解决了此问题)

Absolutely, that would be the most efficient use of resources. Let me warn you: there are a lot of details in this; I'll try to list as many as would get you going. I encourage you to add your own comprehensive answer listing any problems that you encountered and the workaround (once you are through this)

关于集群的创建/终止

  • 对于集群的创建和终止,您具有 EmrCreateJobFlowOperator 和 EmrTerminateJobFlowOperator

如果您不使用 AWS SecretAccessKey (并完全依赖 IAM 角色);实例化 AWS 相关的 hook 或 operator c $ c> Airflow 将自动退回到基础 EC2 附加的 IAM 角色

Don't fret if you do not use AWS SecretAccessKey (and rely wholly on IAM Roles); instantiating any AWS-related hook or operator in Airflow will automatically fall-back to underlying EC2's attached IAM Role

如果您不使用 EMR-Steps API 提交工作,那么您还必须手动感觉使用 Sensors 。已经有一个用于轮询创建阶段的传感器,称为 EmrJobFlowSensor ,您也可以对其稍加修改以创建用于终止的传感器

If your'e NOT using the EMR-Steps API for job-submission, then you'll also have to manually sense both the above operations using Sensors. There's already a sensor for polling creation phase called EmrJobFlowSensor and you can modify it slightly to create a sensor for termination too

您在 job_flow_extra 。您还可以在 Connection (例如 my_emr_conn ) extra 参数,但请避免使用它,因为它经常会破坏 SQLAlchemy ORM加载(因为它有很大的 json )

You pass your cluster-config JSON in job_flow_extra. You can also pass configs in a Connection's (like my_emr_conn) extra param, but refrain from it because it often breaks SQLAlchemy ORM loading (since its a big json)

关于工作提交

  • 您要么将工作提交到 Emr 使用EMR-Steps API,可以在集群创建阶段(在Cluster-Configs JSON中)或之后使用 add_job_flow_steps() 。甚至还有一个 Airflow 中的> emr_add_steps_operator() ,这也需要 EmrStepSensor 。您可以在 AWS 文档,您可能还必须使用 命令-runner.jar

  • You either submit jobs to Emr using EMR-Steps API, which can be done either during cluster creation phase (within the Cluster-Configs JSON) or afterwards using add_job_flow_steps(). There's even an emr_add_steps_operator() in Airflow which also requires an EmrStepSensor. You can read more about it in AWS docs and you might also have to use command-runner.jar

对于特定于应用程序的情况(例如 Hive , Livy ),则可以使用它们的特定方式。例如,您可以使用 HiveServer2Hook 提交 Hive 作业。这是一个棘手的部分: run_job_flow()调用(在集群创建阶段进行)仅为您提供 job_flow_id (集群- ID)。您必须使用 describe_cluster()调用,使用 EmrHook 获取主节点的私有IP 。然后,您可以使用此程序以编程方式创建 Connection (例如 Hive服务器2 Thrift 连接)并将其用于将计算提交到集群。而且,不要忘记在完成工作流之前删除这些连接(为了美观)。

For application-specific cases (like Hive, Livy), you can use their specific ways. For instance you can use HiveServer2Hook to submit a Hive job. Here's a tricky part: The run_job_flow() call (made during cluster-creation phase) only gives you a job_flow_id (cluster-id). You'll have to use a describe_cluster() call using EmrHook to obtain the private-IP of the master node. Using this you will then be able to programmatically create a Connection (such as Hive Server 2 Thrift connection) and use it for submitting your computations to cluster. And don't forget to delete those connections (for elegance) before completing your workflow.

最后,还有一些古老的bash与集群交互。为此,您还应该传递 EC2 密钥对。之后,您可以以编程方式创建 SSH 连接并使用它(与 SSHHook 或 SSHOperator )以在群集上运行作业。在 Airflow 此处

Finally there's the good-old bash for interacting with cluster. For this you should also pass an EC2 key pair during cluster creation phase. Afterwards, you can programmatically create an SSH connection and use it (with an SSHHook or SSHOperator) for running jobs on your cluster. Read more about SSH-stuff in Airflow here

特别是用于提交 Spark 职位到远程 Emr 群集,请阅读此讨论

Particularly for submitting Spark jobs to remote Emr cluster, read this discussion

更多推荐

使用Airflow dag运行EMR集群,一旦完成任务,EMR将终止

本文发布于:2023-11-23 16:31:19,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1622132.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:集群   完成任务   Airflow   dag   EMR

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!