Apache Spark:文件批处理

编程入门 行业动态 更新时间:2024-10-27 16:37:30
本文介绍了Apache Spark:文件批处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我在HDFS上有目录,子目录设置,我想在将所有文件立即加载到内存之前对其进行预处理.我基本上拥有大文件(1MB),一旦处理,它们将更像1KB,然后执行sc.wholeTextFiles来开始我的分析

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis

如何在目录/子目录中的每个文件(*.xml)上循环,进行操作(以示例为例,保留第一行),然后将结果转储回HDFS(新文件,说.xmlr)吗?

How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?

推荐答案

我建议您只使用sc.wholeTextFiles并使用转换对其进行预处理,然后将所有转换另存为单个压缩序列文件(您可以请参阅我的指南以这样做: 0x0fff/spark-hdfs-integration/)

I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: 0x0fff/spark-hdfs-integration/)

另一种选择可能是编写一个mapreduce,一次处理整个文件,然后按照我之前的建议将它们保存到序列文件中: github/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter .java .这是"Hadoop:权威指南"一书中描述的示例,请看一下

Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: github/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it

在两种情况下,您几乎都将执行相同的操作,Spark和Hadoop都将启动一个进程(Spark任务或Hadoop映射器)来处理这些文件,因此通常,这两种方法都将使用相同的逻辑来工作.我建议您从Spark一个开始,因为考虑到您已经拥有一个带有Spark的集群,因此实现起来更简单

In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

更多推荐

Apache Spark:文件批处理

本文发布于:2023-11-24 11:31:50,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1625030.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:批处理   文件   Apache   Spark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!