Apache Airflow或Apache Beam用于数据处理和作业调度

编程入门 行业动态 更新时间:2024-10-13 20:20:56
本文介绍了Apache Airflow或Apache Beam用于数据处理和作业调度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试提供有用的信息,但我远不是一名数据工程师.

I'm trying to give useful information but I am far from being a data engineer.

我目前正在使用python库pandas对我的数据执行一连串的转换,该数据具有很多输入(当前为CSV和excel文件).输出是几个excel文件.我希望能够以每月一次的并行计算执行计划的,受监视的批处理作业(我的意思是不像我对熊猫所做的那样顺序执行).

I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.

我不太了解Beam或Airflow,我迅速阅读了文档,看来两者都可以实现.我应该使用哪一个?

I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?

推荐答案

Apache Airflow 不是数据处理引擎.

Apache Airflow is not a data processing engine.

Airflow是一个平台,可以以编程方式编写,安排和 监控工作流程.

Airflow is a platform to programmatically author, schedule, and monitor workflows.

Cloud Dataflow 是Google Cloud上的一项完全托管的服务,可用于数据处理.您可以编写您的Dataflow代码,然后使用Airflow计划和监视Dataflow作业.如果工作失败,Airflow还允许您重试作业(重试次数是可配置的).如果您想通过Slack或电子邮件发送警报,或者数据流管道失败,也可以在Airflow中进行配置.

Cloud Dataflow is a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.

更多推荐

Apache Airflow或Apache Beam用于数据处理和作业调度

本文发布于:2023-11-24 02:34:13,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1623651.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:作业   数据处理   Apache   Airflow   Beam

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!