如果这个单词存在 - 它必须写入一个输出文件如果这个单词不存在 - 它必须写入另一个输出文件
我在hadoop中尝试了几个例子。我有两个问题:
两个文件大约每个200MB。检查另一个文件中的每个单词可能会导致内存不足。有没有其他的方法来做到这一点?
如何将数据写入不同的文件,因为hadoop的reduce阶段的输出只写入一个文件。
谢谢。
您将得到与不同的同样多的减少输出,每个输出都包含文档的缺失单词。您可以在'reduce'的开头写出< missingsource> ONCE来标记这些文件。
(* 1)How to find the source in map(0.20 ):
private String localname; 私人文本outkey = new Text(); 私人文字outvalue = new Text(); ... public void setup(Context context)throws InterruptedException,IOException { super.setup(context); localname =((FileSplit)context.getInputSplit())。getPath()。toString(); $ b $ public void map(Object key,Text value,Context context) throws IOException,InterruptedException { ... outkey.set ...); outvalue.set(localname); context.write(outkey,outvalue); }
I want to build a hadoop application which can read words from one file and search in another file.
If the word exists - it has to write to one output file If the word doesn't exist - it has to write to another output file
I tried a few examples in hadoop. I have two questions
Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?
How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?
Thank you.
解决方案How I would do it:
You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.
(*1) Howto find out the source in map (0.20):
private String localname; private Text outkey = new Text(); private Text outvalue = new Text(); ... public void setup(Context context) throws InterruptedException, IOException { super.setup(context); localname = ((FileSplit)context.getInputSplit()).getPath().toString(); } public void map(Object key, Text value, Context context) throws IOException, InterruptedException { ... outkey.set(...); outvalue.set(localname); context.write(outkey, outvalue); }
更多推荐
Hadoop从另一个文件中的一个文件中搜索单词
发布评论