加载到Hadoop MR之前的文件验证(File Validation before loading into Hadoop MR)

编程入门 行业动态 更新时间:2024-10-08 13:30:44
加载到Hadoop MR之前的文件验证(File Validation before loading into Hadoop MR)

我有一个输入bz2文件的文件夹,其中一些可能已损坏,我想在运行我的MR作业之前删除所有损坏/无效的bz2文件。 做这件事的好方法是什么?

I have a folder of input bz2 files, and some of them might be corrupted, I want to remove all the corrupted/invalid bz2 files before running my MR job. What's a good way of doing it?

最满意答案

使用bzip2 -t来测试bzip文件是否已损坏。 如果它已损坏,我想你可以看到这样的东西。

bzip2: test1.txt: bad magic number (file not created by bzip2) bzip2: 2: bad magic number (file not created by bzip2) You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.

因此,如果您的文件位于本地文件系统中,则使用基于上述要点的某些shell脚本应该可以正常工作。 如果您的文件已经在HDFS上,那么使用带有映射器的Hadoop流作为脚本输出损坏的文件而不使用reducer,或使用reducer删除或后处理这些文件。

use bzip2 -t to test if bzip file is corrupted or not. If it's corrupted, i think you can see something like this.

bzip2: test1.txt: bad magic number (file not created by bzip2) bzip2: 2: bad magic number (file not created by bzip2) You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.

So if your files are in your local file system, using some shell script based on the point mentioned above should work. If you files are already on HDFS, then use Hadoop streaming with mapper as a script to output corrupted files and no reducer, or reducer to delete or post process those files.

更多推荐

本文发布于:2023-08-02 08:43:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1372663.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:加载   文件   Hadoop   loading   File

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!