Spark 1.6 DirectFileOutputCommitter

编程入门行业动态更新时间:2024-10-25 00:27:48

本文介绍了Spark 1.6 DirectFileOutputCommitter的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

使用pyspark将文本文件保存到S3时遇到问题。我可以保存到S3，但它首先上传到S3上的_temporary，然后继续复制到预期的位置。这显着增加了工作时间。我试图编译一个DirectFileOutputComter，它应该直接写入想要的S3 url，但我无法让Spark使用这个类。

I am having a problem saving text files to S3 using pyspark. I am able to save to S3, but it first uploads to a _temporary on S3 and then proceeds to copy to the intended location. This increases the jobs run time significantly. I have attempted to compile a DirectFileOutputComitter which should write directly to the intended S3 url, but I cannot get Spark to utilize this class.

示例：

someRDD.saveAsTextFile（s3a：// somebucket / savefolder）

这会创建一个

this creates a
s3a：// somebucket / savefolder / _temporary /
s3a://somebucket/savefolder/_temporary/
$ b $ p

目录，然后写入之后，S3复制操作将文件移回到

directory which is then written to after which a S3 copy operation moves the files back to

s3a：// somebucket / savefolder

s3a://somebucket/savefolder

我的问题是，任何人都有DirectFileOutputCommiter的工作罐，或者任何人有经验这个问题。

My question is does anyone have a working jar of the DirectFileOutputCommiter, or if anyone has experience working around this issue.