推荐使用哪种工具来每天/每周调度Spark作业. 1)Oozie 2)路易吉 3)阿兹卡班 4)计时 5)气流
Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow
谢谢.
推荐答案从此处更新我以前的答案:建议使用调度工具来构建基于hadoop的数据管道
Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines
- 气流:先尝试一下.体面的用户界面,类似于Python的作业定义,对于非程序员来说半可访问,依赖声明语法很奇怪.
- Airflow内置了对事实的支持,即计划的作业通常需要重新运行和/或回填.确保您建立了支持此功能的管道.
- Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
- Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this.
- Azkaban要求简单(不能使用不存在的功能),而其他则巧妙地鼓励复杂性.
- 签出Azkaban CLI项目以编程方式创建作业. github/mtth/azkaban (示例 github/joeharris76/azkaban_examples )
- Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
- Check out the Azkaban CLI project for programmatic job creation. github/mtth/azkaban (examples github/joeharris76/azkaban_examples)
哲学:
简单的管道比复杂的管道要好:易于创建,易于理解(尤其是在未创建时),并且易于调试/修复.
Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.
当需要复杂的操作时,您希望以完全成功或完全失败的方式封装它们.
When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.
如果您可以使其幂等(再次运行它可以产生相同的结果),那就更好了.
If you can make it idempotent (running it again creates identical results) then that’s even better.
更多推荐
及时安排火花作业
发布评论