来自执行者的PySpark日志记录

编程入门 行业动态 更新时间:2024-10-11 15:18:07
本文介绍了来自执行者的PySpark日志记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

在执行程序上使用pyspark访问Spark的log4j记录器的正确方法是什么?

What is the correct way to access the log4j logger of Spark using pyspark on an executor?

在驱动程序中这样做很容易,但是我似乎无法理解如何访问执行程序上的日志记录功能,这样我就可以在本地登录并让YARN收集本地日志.

It's easy to do so in the driver but I cannot seem to understand how to access the logging functionalities on the executor so that I can log locally and let YARN collect the local logs.

有什么方法可以访问本地记录器?

Is there any way to access the local logger?

标准的日志记录过程还不够,因为我无法从执行程序访问spark上下文.

The standard logging procedure is not enough because I cannot access the spark context from the executor.

推荐答案

您不能在执行程序上使用本地log4j记录器.执行器jvms产生的Python工人与Java没有回调"连接,他们只接收命令.但是有一种方法可以使用标准的python日志记录从执行程序中记录并由YARN捕获.

You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.

在您的HDFS上放置python模块文件,该文件配置每个python worker一次日志记录并代理日志记录功能(将其命名为logger.py):

On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py):

import os import logging import sys class YarnLogger: @staticmethod def setup_logger(): if not 'LOG_DIRS' in os.environ: sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled') return file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log' logging.basicConfig(filename=file, level=logging.INFO, format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s') def __getattr__(self, key): return getattr(logging, key) YarnLogger.setup_logger()

然后将该模块导入您的应用程序内:

Then import this module inside your application:

spark.sparkContext.addPyFile('hdfs:///path/to/logger.py') import logger logger = logger.YarnLogger()

您可以在pyspark函数内部使用常规日志记录库之类的

And you can use in inside your pyspark functions like normal logging library:

def map_sth(s): logger.info("Mapping " + str(s)) return s spark.range(10).rdd.map(map_sth).count()

pyspark.log将在资源管理器上可见,并在应用程序完成时收集,因此您以后可以使用yarn logs -applicationId ....访问这些日志.

The pyspark.log will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId .....

更多推荐

来自执行者的PySpark日志记录

本文发布于:2023-11-25 13:07:01,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1629873.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:执行者   日志   PySpark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!