来自执行者的PySpark日志记录

编程入门行业动态更新时间:2024-10-11 15:18:07

本文介绍了来自执行者的PySpark日志记录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

在执行程序上使用pyspark访问Spark的log4j记录器的正确方法是什么?

What is the correct way to access the log4j logger of Spark using pyspark on an executor?

在驱动程序中这样做很容易，但是我似乎无法理解如何访问执行程序上的日志记录功能，这样我就可以在本地登录并让YARN收集本地日志.

It's easy to do so in the driver but I cannot seem to understand how to access the logging functionalities on the executor so that I can log locally and let YARN collect the local logs.

有什么方法可以访问本地记录器?

Is there any way to access the local logger?

标准的日志记录过程还不够，因为我无法从执行程序访问spark上下文.

The standard logging procedure is not enough because I cannot access the spark context from the executor.

推荐答案

您不能在执行程序上使用本地log4j记录器.执行器jvms产生的Python工人与Java没有回调"连接，他们只接收命令.但是有一种方法可以使用标准的python日志记录从执行程序中记录并由YARN捕获.

You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.

在您的HDFS上放置python模块文件，该文件配置每个python worker一次日志记录并代理日志记录功能(将其命名为logger.py):

On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py):

import os import logging import sys class YarnLogger: @staticmethod def setup_logger(): if not 'LOG_DIRS' in os.environ: sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled') return file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log' logging.basicConfig(filename=file, level=logging.INFO, format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s') def __getattr__(self, key): return getattr(logging, key) YarnLogger.setup_logger()

然后将该模块导入您的应用程序内:

Then import this module inside your application:

spark.sparkContext.addPyFile('hdfs:///path/to/logger.py') import logger logger = logger.YarnLogger()

您可以在pyspark函数内部使用常规日志记录库之类的

And you can use in inside your pyspark functions like normal logging library:

def map_sth(s): logger.info("Mapping " + str(s)) return s spark.range(10).rdd.map(map_sth).count()

pyspark.log将在资源管理器上可见，并在应用程序完成时收集，因此您以后可以使用yarn logs -applicationId ....访问这些日志.

The pyspark.log will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId .....

更多推荐

来自执行者的PySpark日志记录

本文发布于:2023-11-25 13:07:01，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1629873.html