在执行程序上使用pyspark访问Spark的log4j记录器的正确方法是什么?
What is the correct way to access the log4j logger of Spark using pyspark on an executor?
在驱动程序中这样做很容易,但是我似乎无法理解如何访问执行程序上的日志记录功能,这样我就可以在本地登录并让YARN收集本地日志.
It's easy to do so in the driver but I cannot seem to understand how to access the logging functionalities on the executor so that I can log locally and let YARN collect the local logs.
有什么方法可以访问本地记录器?
Is there any way to access the local logger?
标准的日志记录过程还不够,因为我无法从执行程序访问spark上下文.
The standard logging procedure is not enough because I cannot access the spark context from the executor.
推荐答案您不能在执行程序上使用本地log4j记录器.执行器jvms产生的Python工人与Java没有回调"连接,他们只接收命令.但是有一种方法可以使用标准的python日志记录从执行程序中记录并由YARN捕获.
You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.
在您的HDFS上放置python模块文件,该文件配置每个python worker一次日志记录并代理日志记录功能(将其命名为logger.py):
On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py):
import os import logging import sys class YarnLogger: @staticmethod def setup_logger(): if not 'LOG_DIRS' in os.environ: sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled') return file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log' logging.basicConfig(filename=file, level=logging.INFO, format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s') def __getattr__(self, key): return getattr(logging, key) YarnLogger.setup_logger()然后将该模块导入您的应用程序内:
Then import this module inside your application:
spark.sparkContext.addPyFile('hdfs:///path/to/logger.py') import logger logger = logger.YarnLogger()您可以在pyspark函数内部使用常规日志记录库之类的
And you can use in inside your pyspark functions like normal logging library:
def map_sth(s): logger.info("Mapping " + str(s)) return s spark.range(10).rdd.map(map_sth).count()pyspark.log将在资源管理器上可见,并在应用程序完成时收集,因此您以后可以使用yarn logs -applicationId ....访问这些日志.
The pyspark.log will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId .....
更多推荐
来自执行者的PySpark日志记录
发布评论