Hadoop从另一个文件中的一个文件中搜索单词

编程入门行业动态更新时间:2024-10-24 16:24:21

本文介绍了Hadoop从另一个文件中的一个文件中搜索单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

如果这个单词存在 - 它必须写入一个输出文件如果这个单词不存在 - 它必须写入另一个输出文件

我在hadoop中尝试了几个例子。我有两个问题：

两个文件大约每个200MB。检查另一个文件中的每个单词可能会导致内存不足。有没有其他的方法来做到这一点？

如何将数据写入不同的文件，因为hadoop的reduce阶段的输出只写入一个文件。

谢谢。

将'map'中的值拆分为字，发出（（< word>，< source>）（* 1）

，您将进入'reduce'：（< word>，< sources> list> >

检查源列表（对于所有源都可能是长的）

如果不是所有源都在列表中，则每次发出（< missingsource> ，< word>）

job2：job.setNumReduceTasks（< numberofsources>）

job2：发射'map'（< missingsource $ lt; word>）

job2：在'reduce'中为每个< missingsource>发出所有（null，< word>）

ol>

您将得到与不同的同样多的减少输出，每个输出都包含文档的缺失单词。您可以在'reduce'的开头写出< missingsource> ONCE来标记这些文件。

（* 1）How to find the source in map（0.20 ）：

private String localname; 私人文本outkey = new Text（）; 私人文字outvalue = new Text（）; ... public void setup（Context context）throws InterruptedException，IOException { super.setup（context）; localname =（（FileSplit）context.getInputSplit（））。getPath（）。toString（）; $ b $ public void map（Object key，Text value，Context context） throws IOException，InterruptedException { ... outkey.set ...）; outvalue.set（localname）; context.write（outkey，outvalue）; }

I want to build a hadoop application which can read words from one file and search in another file.

If the word exists - it has to write to one output file If the word doesn't exist - it has to write to another output file

I tried a few examples in hadoop. I have two questions

Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?

How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?

Thank you.
解决方案
How I would do it:

split value in 'map' by words, emit (<word>, <source>) (*1)

you'll get in 'reduce': (<word>, <list of sources>)

check source-list (might be long for both/all sources)

if NOT all sources are in the list, emit every time (<missingsource>, <word>)

job2: job.setNumReduceTasks(<numberofsources>)

job2: emit in 'map' (<missingsource>, <word>)

job2: emit for each <missingsource> in 'reduce' all (null, <word>)

You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.

(*1) Howto find out the source in map (0.20):
private String localname; private Text outkey = new Text(); private Text outvalue = new Text(); ... public void setup(Context context) throws InterruptedException, IOException { super.setup(context); localname = ((FileSplit)context.getInputSplit()).getPath().toString(); } public void map(Object key, Text value, Context context) throws IOException, InterruptedException { ... outkey.set(...); outvalue.set(localname); context.write(outkey, outvalue); }

更多推荐

Hadoop从另一个文件中的一个文件中搜索单词

本文发布于:2023-11-24 10:28:30，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1624840.html

版权声明:本站内容均来自互联网，仅供演示用，请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系，我们将在24小时内删除。

文件单词 Hadoop

上一篇：另一个SendKeys

下一篇：如何从同一项目中的另一个 Kubernetes 集群调用一个 Kubernetes 集群公开的服务

发布评论取消回复

评论列表（有 0 条评论）

最近发表

荆门网站建设的重要性

win10蓝屏终止代码CRITICAL_PROCESS_DIED解决方法

您可以尝试添加 --skip-broken 选项来解决该问题您可以尝试执行：rpm -Va --nofiles --nodigest 解决方案

关于无线网络波动大的解决办法

Windows10 关于系统中断CPU占用过高导致电脑变卡的解决办法

VS 2019 点击页面自动定位到解决方案资源管理器目录位置

（亲测解决）VMware打开需要半天才进入、打开系统很慢、运行很慢解决办法

Typora官网下载的最新版本mac10.13以下版本用不了的解决办法

成功解决ModuleNotFoundError: No module named ‘torch._C‘

MySQL:由于找不到VCRUNTIME140_1.dll，无法继续执行代码。重新安装程序可能会解决此问题

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍！

热门文章

从源“http://localhost:5173”访问“...”处的 XMLHttpRequest 已被 CORS 策略阻止

币安API错误代码1102，未发送强制参数“时间戳”

如果我在bot telegram nodejs中使用editMessageMedia，我如何制作标题

在 Node.js 中从网络流创建 blob

使用 Node.js / ES6 如何设置 dotenv 文件的自定义路径？

使用 NODE.JS 和 html5 实现低延迟（50 毫秒）视频流

如何从nodejs连接laravel>laravel

使用nodejs观看目录

如果文件包含特定字符串，如何跳过 GitHub 工作流程步骤？

FirebaseError：无法从.env加载环境变量

标签列表

文件

如何在

Python

系统

java

方法

数据

错误

windows

函数

android

linux

教程

如何使用

代码

字符串

计算机

电脑

服务器

NET

应用程序

数组

PHP

MySQL

SQL

对象

项目

程序

数据库

word