如何将.gz文件解压缩到hadoop的新目录中？

编程入门行业动态更新时间:2024-10-19 17:22:37

本文介绍了如何将.gz文件解压缩到hadoop的新目录中？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我在hdfs的一个文件夹中有一堆.gz文件。我想将所有这些.gz文件解压到hdfs中的新文件夹。我应该怎么做？

解决方案

我可以通过3种不同的方式来实现它。

使用Linux命令行

hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt

我的gzip文件是 Links.txt.gz 输出存储在 /tmp/unzipped/Links.txt中

使用Java程序

在 Hadoop Definitve指南一书中，有关于编解码器的部分。在该节中，有一个程序使用 CompressionCodecFactory 解压缩输出。我正在重新生成该代码：

package com.myorg.hadooptests; 导入org.apache.hadoop.conf.Configuration; 导入org.apache.hadoop.fs.FileSystem; 导入org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.iopress.CompressionCodec; import org.apache.hadoop.iopress.CompressionCodecFactory; import java.io.InputStream; import java.io.OutputStream; import java.URI; public class FileDecompressor { public static void main（String [] args）throws Exception { String uri = args [0]; Configuration conf = new Configuration（）; FileSystem fs = FileSystem.get（URI.create（uri），conf）; Path inputPath = new Path（uri）; CompressionCodecFactory factory = new CompressionCodecFactory（conf）; CompressionCodec codec = factory.getCodec（inputPath）; if（codec == null）{ System.err.println（没有为+ uri找到编解码器; System.exit（1）; } 字符串outputUri = CompressionCodecFactory.removeSuffix（uri，codec.getDefaultExtension（））; InputStream in = null; OutputStream out = null; 尝试{ in = codec.createInputStream（fs.open（inputPath））; out = fs.create（new Path（outputUri））; IOUtils.copyBytes（in，out，conf）; } finally { IOUtils.closeStream（in）; IOUtils.closeStream（out）;

这段代码需要gz文件路径作为输入。您可以这样执行：

FileDecompressor< gzipped file name>
例如当我为我的gzip文件执行时：
FileDecompressor /tmp/Links.txt.gz
我在位置获得了解压缩文件： /tmp/Links.txt

它将解压缩的文件存储在同一个文件夹中。因此，您需要修改此代码以获取2个输入参数：<输入文件路径>和<输出文件夹> 。

一旦你使用这个程序，你可以编写一个Shell / Perl / Python脚本来调用这个程序为您的每个输入。

使用Pig脚本
您可以编写一个简单的Pig脚本来实现此目的。

我编写了以下脚本，它可以工作：
A = LOAD'/tmp/Links.txt.gz'使用PigStorage（）; 将A存储到'/ tmp / tmp_unzipped /'使用PigStorage（）; mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt rm / tmp / tmp_unzipped /
运行此脚本时，解压后的内容将存储在临时文件夹中： / tmp / tmp_unzipped 。此文件夹将包含
/ tmp / tmp_unzipped / _SUCCESS / tmp / tmp_unzipped / part-m-00000
part-m-00000 包含解压缩后的文件。因此，我们需要使用以下命令显式重命名它，最后删除 / tmp / tmp_unzipped 文件夹：
mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt rm / tmp / tmp_unzipped /
所以，如果你使用这个Pig脚本，你只需要照顾参数化文件名（Links.txt.gz和Links.txt）。

同样，一旦你获得了这个脚本的工作，你可以编写一个Shell / Perl / Python脚本来为你的每个输入调用这个Pig脚本。

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?
解决方案
I can think of achieving it through 3 different ways.

Using Linux command line

Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is Links.txt.gz The output gets stored in /tmp/unzipped/Links.txt

Using Java program

In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:
package com.myorg.hadooptests; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.iopress.CompressionCodec; import org.apache.hadoop.iopress.CompressionCodecFactory; import java.io.InputStream; import java.io.OutputStream; import java.URI; public class FileDecompressor { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); if (codec == null) { System.err.println("No codec found for " + uri); System.exit(1); } String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); InputStream in = null; OutputStream out = null; try { in = codec.createInputStream(fs.open(inputPath)); out = fs.create(new Path(outputUri)); IOUtils.copyBytes(in, out, conf); } finally { IOUtils.closeStream(in); IOUtils.closeStream(out); } } }
This code takes the gz file path as input. You can execute this as:
FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location: /tmp/Links.txt

It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

Using Pig script

You can write a simple Pig script to achieve this.

I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage(); Store A into '/tmp/tmp_unzipped/' USING PigStorage(); mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
/tmp/tmp_unzipped/_SUCCESS /tmp/tmp_unzipped/part-m-00000
The part-m-00000 contains the unzipped file.

Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

更多推荐

如何将.gz文件解压缩到hadoop的新目录中？

本文发布于:2023-11-05 08:08:27，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1560317.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

解压缩如何将文件目录中 gz

上一篇：使用libzip将文件解压缩到磁盘

下一篇：如何将元组列表解压缩到单个列表中?

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word