Apache Spark:文件批处理

编程入门行业动态更新时间:2024-10-27 16:37:30

本文介绍了Apache Spark:文件批处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我在HDFS上有目录，子目录设置，我想在将所有文件立即加载到内存之前对其进行预处理.我基本上拥有大文件(1MB)，一旦处理，它们将更像1KB，然后执行sc.wholeTextFiles来开始我的分析

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis

如何在目录/子目录中的每个文件(*.xml)上循环，进行操作(以示例为例，保留第一行)，然后将结果转储回HDFS(新文件，说.xmlr)吗?

How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?

推荐答案

我建议您只使用sc.wholeTextFiles并使用转换对其进行预处理，然后将所有转换另存为单个压缩序列文件(您可以请参阅我的指南以这样做: 0x0fff/spark-hdfs-integration/)

I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: 0x0fff/spark-hdfs-integration/)

另一种选择可能是编写一个mapreduce，一次处理整个文件，然后按照我之前的建议将它们保存到序列文件中: github/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter .java .这是"Hadoop:权威指南"一书中描述的示例，请看一下

Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: github/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it

在两种情况下，您几乎都将执行相同的操作，Spark和Hadoop都将启动一个进程(Spark任务或Hadoop映射器)来处理这些文件，因此通常，这两种方法都将使用相同的逻辑来工作.我建议您从Spark一个开始，因为考虑到您已经拥有一个带有Spark的集群，因此实现起来更简单

In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

更多推荐

Apache Spark:文件批处理

本文发布于:2023-11-24 11:31:50，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1625030.html