如何从 Airflow 向 EMR 集群提交 Spark 作业?

编程入门 行业动态 更新时间:2024-10-12 14:21:16
本文介绍了如何从 Airflow 向 EMR 集群提交 Spark 作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

如何在 EMR 主集群(由 Terraform 创建)和 Airflow 之间建立连接.我在具有相同 SG、VPC 和子网的 AWS EC2 服务器下设置了气流.

我需要解决方案,以便 Airflow 可以与 EMR 对话并执行 Spark 提交.

下面的代码将列出活动和终止的 EMR 集群,我也可以微调以获得活动集群:-

fromairflow.contrib.hooks.aws_hook import AwsHook导入 boto3hook = AwsHook(aws_conn_id='aws_default')客户端 = hook.get_client_type('emr', 'eu-central-1')对于 a 中的 x:打印(x[‘状态’][‘状态’],x[‘名称’])

我的问题是 - 如何更新我上面的代码可以执行 Spark 提交操作

解决方案

虽然它可能不会直接解决您的特定查询,但从广义上讲,这里有一些方法可以触发 spark-submit on (remote) EMR 通过 Airflow

  • 使用 Apache Livy

    • 这个解决方案实际上是独立于远程服务器的,即EMR
    • 这是一个例子
    • 缺点是 Livy 处于早期阶段,它的 API 出现 不完整和wonky对我来说
  • 使用EmrSteps API

    • 依赖于远程系统:EMR
    • 健壮,但由于它本质上是异步,因此您还需要一个 EmrStepSensor(与 EmrAddStepsOperator 一起)
    • 在单个 EMR 集群上,您不能同时运行多个步骤(尽管有些hacky 解决方法存在)
  • 使用SSHHook/SSHOperator

    • 再次独立于远程系统
    • 相对容易上手
    • 如果您的 spark-submit 命令涉及大量参数,那么(以编程方式)构建该命令会变得很麻烦
  • EDIT-1

    好像还有一种直截了当的方式

  • 指定远程master-IP

    • 独立于远程系统
    • 需要修改全局配置/环境变量
    • 查看@cricket_007的回答了解详情
  • 有用的链接

    • 这个来自 @Kaxil Naik 本人:有没有办法在运行master的不同服务器上提交spark作业
    • Spark通过在 Livy 上提交批处理 POST 方法并跟踪作业,使用 Airflow 提交作业
    • 远程火花提交到在 EMR 上运行的 YARN

    How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.

    I need solutions so that Airflow can talk to EMR and execute Spark submit.

    aws.amazon/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/

    These blogs have understanding on execution after connection has been established.(Didn't help much)

    In airflow I have made a connection using UI for AWS and EMR:-

    Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-

    from airflow.contrib.hooks.aws_hook import AwsHook import boto3 hook = AwsHook(aws_conn_id=‘aws_default’) client = hook.get_client_type(‘emr’, ‘eu-central-1’) for x in a: print(x[‘Status’][‘State’],x[‘Name’])

    My question is - How can I update my above code can do Spark-submit actions

    解决方案

    While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow

  • Use Apache Livy

    • This solution is actually independent of remote server, i.e., EMR
    • Here's an example
    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me
  • Use EmrSteps API

    • Dependent on remote system: EMR
    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)
    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
  • Use SSHHook / SSHOperator

    • Again independent of remote system
    • Comparatively easier to get started with
    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome
  • EDIT-1

    There seems to be another straightforward way

  • Specifying remote master-IP

    • Independent of remote system
    • Needs modifying Global Configurations / Environment Variables
    • See @cricket_007's answer for details
  • Useful links

    • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master
    • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
    • Remote spark-submit to YARN running on EMR

    更多推荐

    如何从 Airflow 向 EMR 集群提交 Spark 作业?

    本文发布于:2023-11-23 16:29:53,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1622126.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:作业   集群   Airflow   EMR   Spark

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!