如何将.gz文件解压缩到hadoop的新目录中?

编程入门 行业动态 更新时间:2024-10-19 17:22:37
本文介绍了如何将.gz文件解压缩到hadoop的新目录中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我在hdfs的一个文件夹中有一堆.gz文件。我想将所有这些.gz文件解压到hdfs中的新文件夹。我应该怎么做?

解决方案

我可以通过3种不同的方式来实现它。

  • 使用Linux命令行

    hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt

    我的gzip文件是 Links.txt.gz 输出存储在 /tmp/unzipped/Links.txt中

  • 使用Java程序

    在 Hadoop Definitve指南一书中,有关于编解码器的部分。在该节中,有一个程序使用 CompressionCodecFactory 解压缩输出。我正在重新生成该代码:

    package com.myorg.hadooptests; 导入org.apache.hadoop.conf.Configuration; 导入org.apache.hadoop.fs.FileSystem; 导入org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.iopress.CompressionCodec; import org.apache.hadoop.iopress.CompressionCodecFactory; import java.io.InputStream; import java.io.OutputStream; import java.URI; public class FileDecompressor { public static void main(String [] args)throws Exception { String uri = args [0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri),conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); if(codec == null){ System.err.println(没有为+ uri找到编解码器; System.exit(1); } 字符串outputUri = CompressionCodecFactory.removeSuffix(uri,codec.getDefaultExtension()); InputStream in = null; OutputStream out = null; 尝试{ in = codec.createInputStream(fs.open(inputPath)); out = fs.create(new Path(outputUri)); IOUtils.copyBytes(in,out,conf); } finally { IOUtils.closeStream(in); IOUtils.closeStream(out);

    这段代码需要gz文件路径作为输入。 您可以这样执行:

    FileDecompressor< gzipped file name>

    例如当我为我的gzip文件执行时:

    FileDecompressor /tmp/Links.txt.gz

    我在位置获得了解压缩文件: /tmp/Links.txt

    它将解压缩的文件存储在同一个文件夹中。因此,您需要修改此代码以获取2个输入参数:<输入文件路径>和<输出文件夹> 。

    一旦你使用这个程序,你可以编写一个Shell / Perl / Python脚本来调用这个程序为您的每个输入。

  • 使用Pig脚本

    您可以编写一个简单的Pig脚本来实现此目的。

    我编写了以下脚本,它可以工作:

    A = LOAD'/tmp/Links.txt.gz'使用PigStorage(); 将A存储到'/ tmp / tmp_unzipped /'使用PigStorage(); mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt rm / tmp / tmp_unzipped /

    运行此脚本时,解压后的内容将存储在临时文件夹中: / tmp / tmp_unzipped 。此文件夹将包含

    / tmp / tmp_unzipped / _SUCCESS / tmp / tmp_unzipped / part-m-00000

    part-m-00000 包含解压缩后的文件。 因此,我们需要使用以下命令显式重命名它,最后删除 / tmp / tmp_unzipped 文件夹:

    mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt rm / tmp / tmp_unzipped /

    所以,如果你使用这个Pig脚本,你只需要照顾参数化文件名(Links.txt.gz和Links.txt)。

    同样,一旦你获得了这个脚本的工作,你可以编写一个Shell / Perl / Python脚本来为你的每个输入调用这个Pig脚本。

  • I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

    解决方案

    I can think of achieving it through 3 different ways.

  • Using Linux command line

    Following command worked for me.

    hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt

    My gzipped file is Links.txt.gz The output gets stored in /tmp/unzipped/Links.txt

  • Using Java program

    In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

    package com.myorg.hadooptests; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.iopress.CompressionCodec; import org.apache.hadoop.iopress.CompressionCodecFactory; import java.io.InputStream; import java.io.OutputStream; import java.URI; public class FileDecompressor { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); if (codec == null) { System.err.println("No codec found for " + uri); System.exit(1); } String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); InputStream in = null; OutputStream out = null; try { in = codec.createInputStream(fs.open(inputPath)); out = fs.create(new Path(outputUri)); IOUtils.copyBytes(in, out, conf); } finally { IOUtils.closeStream(in); IOUtils.closeStream(out); } } }

    This code takes the gz file path as input. You can execute this as:

    FileDecompressor <gzipped file name>

    For e.g. when I executed for my gzipped file:

    FileDecompressor /tmp/Links.txt.gz

    I got the unzipped file at location: /tmp/Links.txt

    It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

    Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

  • Using Pig script

    You can write a simple Pig script to achieve this.

    I wrote the following script, which works:

    A = LOAD '/tmp/Links.txt.gz' USING PigStorage(); Store A into '/tmp/tmp_unzipped/' USING PigStorage(); mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/

    When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain

    /tmp/tmp_unzipped/_SUCCESS /tmp/tmp_unzipped/part-m-00000

    The part-m-00000 contains the unzipped file.

    Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:

    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/

    So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

    Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

  • 更多推荐

    如何将.gz文件解压缩到hadoop的新目录中?

    本文发布于:2023-11-05 08:08:27,感谢您对本站的认可!
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:解压缩   如何将   文件   目录中   gz

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!