Hadoop从另一个文件中的一个文件中搜索单词

编程入门 行业动态 更新时间:2024-10-24 16:24:21
本文介绍了Hadoop从另一个文件中的一个文件中搜索单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

如果这个单词存在 - 它必须写入一个输出文件如果这个单词不存在 - 它必须写入另一个输出文件

我在hadoop中尝试了几个例子。我有两个问题:

两个文件大约每个200MB。检查另一个文件中的每个单词可能会导致内存不足。有没有其他的方法来做到这一点?

如何将数据写入不同的文件,因为hadoop的reduce阶段的输出只写入一个文件。

谢谢。

  • 将'map'中的值拆分为字,发出( (< word>,< source>)(* 1)
  • ,您将进入'reduce':(< word>,< sources> list> >
  • 检查源列表(对于所有源都可能是长的)
  • 如果不是所有源都在列表中,则每次发出(< missingsource> ,< word>)
  • job2:job.setNumReduceTasks(< numberofsources>)
  • job2:发射'map'(< missingsource $ lt; word>)
  • job2:在'reduce'中为每个< missingsource>发出所有(null,< word>)
  • ol>

    您将得到与不同的同样多的减少输出,每个输出都包含文档的缺失单词。您可以在'reduce'的开头写出< missingsource> ONCE来标记这些文件。

    (* 1)How to find the source in map(0.20 ):

    private String localname; 私人文本outkey = new Text(); 私人文字outvalue = new Text(); ... public void setup(Context context)throws InterruptedException,IOException { super.setup(context); localname =((FileSplit)context.getInputSplit())。getPath()。toString(); $ b $ public void map(Object key,Text value,Context context) throws IOException,InterruptedException { ... outkey.set ...); outvalue.set(localname); context.write(outkey,outvalue); }

    I want to build a hadoop application which can read words from one file and search in another file.

    If the word exists - it has to write to one output file If the word doesn't exist - it has to write to another output file

    I tried a few examples in hadoop. I have two questions

    Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?

    How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?

    Thank you.

    解决方案

    How I would do it:

  • split value in 'map' by words, emit (<word>, <source>) (*1)
  • you'll get in 'reduce': (<word>, <list of sources>)
  • check source-list (might be long for both/all sources)
  • if NOT all sources are in the list, emit every time (<missingsource>, <word>)
  • job2: job.setNumReduceTasks(<numberofsources>)
  • job2: emit in 'map' (<missingsource>, <word>)
  • job2: emit for each <missingsource> in 'reduce' all (null, <word>)
  • You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.

    (*1) Howto find out the source in map (0.20):

    private String localname; private Text outkey = new Text(); private Text outvalue = new Text(); ... public void setup(Context context) throws InterruptedException, IOException { super.setup(context); localname = ((FileSplit)context.getInputSplit()).getPath().toString(); } public void map(Object key, Text value, Context context) throws IOException, InterruptedException { ... outkey.set(...); outvalue.set(localname); context.write(outkey, outvalue); }

    更多推荐

    Hadoop从另一个文件中的一个文件中搜索单词

    本文发布于:2023-11-24 10:28:30,感谢您对本站的认可!
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:文件   单词   Hadoop

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!