将Typesafe配置conf文件传递到DataProcSparkOperator

编程入门 行业动态 更新时间:2024-10-16 00:20:58
本文介绍了将Typesafe配置conf文件传递到DataProcSparkOperator的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在使用Google dataproc提交Spark作业,并使用Google Cloud Composer安排它们。不幸的是,我遇到了困难。

I am using Google dataproc to submit spark jobs and google cloud composer to schedule them. Unfortunately, I am facing difficulties.

我依靠 .conf 文件(类型安全的配置文件)传递参数到我的火花工作。

I am relying on .conf files (typesafe config files) to pass arguments to my spark jobs.

我正在为气流dataproc使用以下python代码:

I am using the following python code for the airflow dataproc:

t3 = dataproc_operator.DataProcSparkOperator( task_id ='execute_spark_job_cluster_test', dataproc_spark_jars='gs://snapshots/jars/pubsub-assembly-0.1.14-SNAPSHOT.jar', cluster_name='cluster', main_class = 'coman.ingestion.Main', project_id='project', dataproc_spark_properties={'spark.driver.extraJavaOptions':'gs://file-dev/fileConf/development.conf'}, scopes='www.googleapis/auth/cloud-platform', dag=dag)

但这无法正常工作,并且出现一些错误。

But this is not working and I am getting some errors.

有人可以帮我吗? 基本上,我希望能够覆盖 .conf 文件,并将它们作为参数传递给我的 DataProcSparkOperator 。 我也尝试这样做

Could anyone help me with this? Basically I want to be able to override the .conf files and pass them as arguments to my DataProcSparkOperator. I also tried to do

arguments=`'gs://file-dev/fileConf/development.conf'`:

,但这没有考虑参数中提到的 .conf 文件。

but this didn't take into account the .conf file mentioned in the arguments .

推荐答案

tl; dr您需要将 development.conf 文件转换为字典,以传递给 dataproc_spark_properties 。

tl;dr You need to turn your development.conf file into a dictionary to pass to dataproc_spark_properties.

完整说明:

有两种主要方法设置属性-在群集级别和作业级别。

There are two main ways to set properties -- on the cluster level and on the job level.

1)作业级别

外观就像您尝试在工作级别上设置它们一样: DataProcSparkOperator(dataproc_spark_properties = {'foo':'bar','foo2':'bar2'})。这与 gcloud dataproc作业提交spark --properties foo = bar,foo2 = bar2 或 spark-submit --conf foo = bar相同- conf foo2 = bar2 。这是针对每个职位属性的文档。

Looks like you are trying to set them on the job level: DataProcSparkOperator(dataproc_spark_properties={'foo': 'bar', 'foo2': 'bar2'}). That's the same as gcloud dataproc jobs submit spark --properties foo=bar,foo2=bar2 or spark-submit --conf foo=bar --conf foo2=bar2. Here is the documentation for per-job properties.

spark.driver.extraJavaOptions 的参数应该是传递给Java的命令行参数。例如, -verbose:gc 。

The argument to spark.driver.extraJavaOptions should be command line arguments you would pass to java. For example, -verbose:gc.

2)群集级别

还可以使用 DataprocClusterCreateOperator(properties = {'spark:foo':'bar','spark:foo2':'bar2'})

You can also set properties on a cluster level using DataprocClusterCreateOperator(properties={'spark:foo': 'bar', 'spark:foo2': 'bar2'}), which is the same as gcloud dataproc clusters create --properties spark:foo=bar,spark:foo2=bar2 (documentation). Again, you need to use a dictionary.

重要的是,如果您在集群级别指定属性,则需要为它们添加要添加属性的配置文件前缀至。如果使用 spark:foo = bar ,这意味着将 foo = bar 添加到 / etc /spark/conf/spark-defaults.conf 。 yarn-site.xml 等有类似的前缀。

Importantly, if you specify properties at the cluster level, you need to prefix them with which config file you want to add the property to. If you use spark:foo=bar, that means add foo=bar to /etc/spark/conf/spark-defaults.conf. There are similar prefixes for yarn-site.xml, etc.

3)使用 .conf 集群级别的文件

3) Using your .conf file at the cluster level

如果您不想打开 .conf 文件放入字典,您也可以使用/etc/spark/conf/spark-defaults.conf //github/GoogleCloudPlatform/dataproc-initialization-actions/ rel = nofollow noreferrer>初始化操作,当您创建集群时。

If you don't want to turn your .conf file into a dictionary, you can also just append it to /etc/spark/conf/spark-defaults.conf using an initialization action when you create the cluster.

例如(未经测试):

#!/bin/bash set -euxo pipefail gsutil cp gs://path/to/my.conf . cat my.conf >> /etc/spark/conf/spark-defaults.conf

请注意,您要附加而不是替换现有的配置文件,只是为了让您仅覆盖所需的配置。

Note that you want to append to rather than replace the existing config file, just so that you only override the configs you need to.

更多推荐

将Typesafe配置conf文件传递到DataProcSparkOperator

本文发布于:2023-11-25 16:57:40,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1630586.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:文件   Typesafe   conf   DataProcSparkOperator

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!