在运行时通过气流导出环境变量

编程入门 行业动态 更新时间:2024-10-11 15:22:25
本文介绍了在运行时通过气流导出环境变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我目前正在将之前在bash脚本中实现的工作流转换为Airflow DAG。在bash脚本中,我只是在运行时使用以下变量导出变量

I am currently converting workflows that were implemented in bash scripts before to Airflow DAGs. In the bash scripts, I was just exporting the variables at run time with

export HADOOP_CONF_DIR="/etc/hadoop/conf"

现在我想在Airflow中做同样的事情,但是还没有找到解决方案然而。我发现的一种解决方法是在任何方法或运算符之外,使用 os.environ [VAR_NAME] ='some_text'设置变量,但这意味着它们在脚本执行时即被导出

Now I'd like to do the same in Airflow, but haven't found a solution for this yet. The one workaround I found was setting the variables with os.environ[VAR_NAME]='some_text' outside of any method or operator, but that means they get exported the moment the script gets loaded, not at run time.

现在,当我尝试调用 os.environ [VAR_NAME] ='some_text'在由PythonOperator调用的函数中,它不起作用。我的代码如下:

Now when I try to call os.environ[VAR_NAME] = 'some_text' in a function that gets called by a PythonOperator, it does not work. My code looks like this

def set_env(): os.environ['HADOOP_CONF_DIR'] = "/etc/hadoop/conf" os.environ['PATH'] = "somePath:" + os.environ['PATH'] os.environ['SPARK_HOME'] = "pathToSparkHome" os.environ['PYTHONPATH'] = "somePythonPath" os.environ['PYSPARK_PYTHON'] = os.popen('which python').read().strip() os.environ['PYSPARK_DRIVER_PYTHON'] = os.popen('which python').read().strip() set_env_operator = PythonOperator( task_id='set_env_vars_NOT_WORKING', python_callable=set_env, dag=dag)

现在执行SparkSubmitOperator时,我得到了异常:

Now when my SparkSubmitOperator gets executed, I get the exception:

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

与此相关的我的用例是我有 SparkSubmitOperator ,我将作业提交到YARN,因此必须在环境中设置 HADOOP_CONF_DIR 或 YARN_CONF_DIR 。对我来说,在我的 .bashrc 或任何其他配置中设置它们是不可能的,这就是为什么我需要在运行时设置它们。

My use case where this is relevant is that I have SparkSubmitOperator, where I submit jobs to YARN, therefore either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. Setting them in my .bashrc or any other config is sadly not possible for me, which is why I need to set them at runtime.

最好在执行 SparkSubmitOperator 之前,先在操作员中设置它们,但是如果有可能将它们作为参数传递给 SparkSubmitOperator ,至少可以这样。

Preferably I'd like to set them in an Operator before executing the SparkSubmitOperator, but if there was the possibility to pass them as arguments to the SparkSubmitOperator, that would be at least something.

推荐答案

从火花提交操作符,您可以将环境变量作为字典传递给spark-submit。

From what I can see in the spark submit operator you can pass in environment variables to spark-submit as a dictionary.

:param env_vars: Environment variables for spark-submit. It supports yarn and k8s mode too. :type env_vars: dict

您尝试过吗?

更多推荐

在运行时通过气流导出环境变量

本文发布于:2023-11-23 19:12:49,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1622575.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:气流   环境变量

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!