如何安全地重新启动Airflow并终止长时间运行的任务？

编程入门行业动态更新时间:2024-10-09 22:15:13

本文介绍了如何安全地重新启动Airflow并终止长时间运行的任务？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我使用CeleryExecutor在Kubernetes中运行了Airflow。 Airflow使用 DatabricksOperator 提交并监视Spark作业>。

I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator.

我的流Spark作业运行时间非常长（除非失败或被取消，否则它们将永远运行）。当在运行流作业时杀死Airflow worker的pod时，将发生以下情况：

My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens:

关联的任务变为僵尸（运行状态，但没有心跳进程）

当气流收起僵尸时，任务被标记为失败

火花流作业继续运行

如何强制工人在我的Spark作业关闭之前将其杀死？

我尝试用以下方法杀死芹菜工人： TERM信号，但显然会导致Celery停止接受新任务并等待当前任务完成（文档）。

I've tried killing the Celery worker with a TERM signal, but apparently that causes Celery to stop accepting new tasks and wait for current tasks to finish (docs).

推荐答案

您需要对此问题更加清楚。如果您说Spark集群按预期完成了工作，但未调用on_kill函数，则说明行为正常。

You need to be more clear about the issue. If you are saying that the spark cluster finishes the jobs as expected and not calling the on_kill function, it's expected behavior. As per the docs on kill function is for cleaning up after task get killed.

def on_kill(self) -> None: """ Override this method to cleanup subprocesses when a task instance gets killed. Any use of the threading, subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind. """

在您的情况下，当您手动取消作业时，它正在执行其必须执行的操作。

In your case when you manually kill the job it is doing what it has to do.

现在，即使在成功完成作业后仍要进行清理，请覆盖post_execute函数。根据文档。帖子执行是

Now if you want to have a clean_up even after successful completion of the job, override post_execute function. As per the docs. The post execute is

def post_execute(self, context: Any, result: Any = None): """ This hook is triggered right after self.execute() is called. It is passed the execution context and any results returned by the operator. """