文件的字数统计常用字

编程入门 行业动态 更新时间:2024-10-20 03:26:26
本文介绍了文件的字数统计常用字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我设法以非分布式模式运行Hadoop wordcount示例;我在名为"part-00000"的文件中得到输出;我可以看到它列出了所有输入文件组合的所有单词.

I Have managed to run the Hadoop wordcount example in a non-distributed mode; I get the output in a file named "part-00000"; I can see that it lists all words of all input files combined.

在跟踪单词计数代码之后,我可以看到它占用了行并根据空格对单词进行了分割.

After tracing the wordcount code I can see that it takes lines and splits the words based on spaces.

我正在尝试一种方法,仅列出在多个文件中出现的单词及其出现的方式?可以在Map/Reduce中实现? -添加- 这些更改合适吗?

I am trying to think of a way to just list the words that have occurred in multiple files and their occurrences? can this be achieved in Map/Reduce? -Added- Are these changes appropriate?

//changes in the parameters here public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> { // These are the original line; I am not using them but left them here... private final static IntWritable one = new IntWritable(1); private Text word = new Text(); //My changes are here too private Text outvalue=new Text(); FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); private String filename = fileSplit.getPath().getName();; public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); // And here outvalue.set(filename); output.collect(word, outvalue); } } }

推荐答案

您可以修改映射器,以将单词作为键输出,然后将Text作为值输出,以表示单词来源的文件名.然后,在化简器中,您只需要简化文件名并将这些单词出现在多个文件中的那些条目输出.

You could amend the mapper to output the word as a the key, and then a Text as the value representing the filename of where the word came from. Then in your reducer, you just need to dedup the file names and output those entries where the word appears in more than a single file.

要获取要处理的文件的文件名取决于您是否使用新的API(mapred或mapreduce包名称).我知道对于新API,您可以使用 getInputSplit 方法(如果您使用的是TextInputFormat,则可能会将InputSplit更改为FileSplit).对于旧的API,我从未尝试过,但是显然您可以使用名为map.input.file

To get the filename of the file being processed depends on whether you're using the new API or not (mapred or mapreduce package names). I know for the new API you can extract the mapper input split from the Context object using the getInputSplit method (then probably case the InputSplit to a FileSplit, assuming you are using the TextInputFormat). For the old API, i've never tried it, but apparently you can use a configuration property called map.input.file

这对于引入合并器也是一个不错的选择-从同一映射器中删除多个出现的单词.

This would also be a good choice for introducing a Combiner - to dedup out multiple word occurrences from the same mapper.

更新

因此,为响应您的问题,您尝试使用一个名为"reporter"的实例变量,该变量在映射器的类scopt中不存在,请进行以下修改:

So in response to your problem, you're trying to use an instance variable called reporter, which doesn't exist in the class scopt of the mapper, amend as follows:

public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> { // These are the original line; I am not using them but left them here... private final static IntWritable one = new IntWritable(1); private Text word = new Text(); //My changes are here too private Text outvalue=new Text(); private String filename = null; public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { if (filename == null) { filename = ((FileSplit) reporter.getInputSplit()).getPath().getName(); } String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); // And here outvalue.set(filename); output.collect(word, outvalue); } } }

(真的不确定为什么SO不尊重上面的格式...)

(really not sure why SO isn't respecting the formatting in the above...)

更多推荐

文件的字数统计常用字

本文发布于:2023-11-11 07:13:46,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1577739.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:常用字   字数   文件

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!