我正在尝试找出哪种方法可以与Airflow和Spark / Hadoop一起使用。 我已经有一个Spark / Hadoop集群,正在考虑为Airflow创建另一个集群,该集群将作业远程提交给Spark / Hadoop集群。
I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop. I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
任何关于它的建议?从另一个集群远程部署spark看起来有点复杂,并且会创建一些文件配置重复。
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
推荐答案我相信,需要配置 yarn-site.xml 文件,以便为 spark-submit --master yarn --deploy-mode client 开始工作。 (您可以尝试集群部署模式,但是我认为由Airflow管理驱动程序不是一个坏主意)
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
一旦在YARN中部署了Application Master,则Spark在Hadoop集群本地运行。
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
如果确实需要,可以添加 hdfs-site.xml 和 hive -site.xml 也要从Airflow提交(如果可能的话),否则至少应从YARN容器类路径中拾取hdfs-site.xml文件(并非所有NodeManager都可以具有在其上安装的Hive客户端)
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
更多推荐
Airflow和Spark / Hadoop
发布评论