我在HDFS上有目录,子目录设置,我想在将所有文件立即加载到内存之前对其进行预处理.我基本上拥有大文件(1MB),一旦处理,它们将更像1KB,然后执行sc.wholeTextFiles来开始我的分析
I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis
如何在目录/子目录中的每个文件(*.xml)上循环,进行操作(以示例为例,保留第一行),然后将结果转储回HDFS(新文件,说.xmlr)吗?
How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?
推荐答案我建议您只使用sc.wholeTextFiles并使用转换对其进行预处理,然后将所有转换另存为单个压缩序列文件(您可以请参阅我的指南以这样做: 0x0fff/spark-hdfs-integration/)
I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: 0x0fff/spark-hdfs-integration/)
另一种选择可能是编写一个mapreduce,一次处理整个文件,然后按照我之前的建议将它们保存到序列文件中: github/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter .java .这是"Hadoop:权威指南"一书中描述的示例,请看一下
Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: github/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it
在两种情况下,您几乎都将执行相同的操作,Spark和Hadoop都将启动一个进程(Spark任务或Hadoop映射器)来处理这些文件,因此通常,这两种方法都将使用相同的逻辑来工作.我建议您从Spark一个开始,因为考虑到您已经拥有一个带有Spark的集群,因此实现起来更简单
In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark
更多推荐
Apache Spark:文件批处理
发布评论