我有一个输入bz2文件的文件夹,其中一些可能已损坏,我想在运行我的MR作业之前删除所有损坏/无效的bz2文件。 做这件事的好方法是什么?
I have a folder of input bz2 files, and some of them might be corrupted, I want to remove all the corrupted/invalid bz2 files before running my MR job. What's a good way of doing it?
最满意答案
使用bzip2 -t来测试bzip文件是否已损坏。 如果它已损坏,我想你可以看到这样的东西。
bzip2: test1.txt: bad magic number (file not created by bzip2) bzip2: 2: bad magic number (file not created by bzip2) You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.因此,如果您的文件位于本地文件系统中,则使用基于上述要点的某些shell脚本应该可以正常工作。 如果您的文件已经在HDFS上,那么使用带有映射器的Hadoop流作为脚本输出损坏的文件而不使用reducer,或使用reducer删除或后处理这些文件。
use bzip2 -t to test if bzip file is corrupted or not. If it's corrupted, i think you can see something like this.
bzip2: test1.txt: bad magic number (file not created by bzip2) bzip2: 2: bad magic number (file not created by bzip2) You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.So if your files are in your local file system, using some shell script based on the point mentioned above should work. If you files are already on HDFS, then use Hadoop streaming with mapper as a script to output corrupted files and no reducer, or reducer to delete or post process those files.
更多推荐
发布评论