这听起来像一个简单的工作,但使用MapReduce似乎并不那么简单。
我有N个文件,每个文件只有一行文本。 我希望Mapper输出像<filename,score>这样的键值对,其中'score'是从文本行计算的整数。 作为旁注我使用下面的代码片段(希望它是正确的)。
FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName();假设映射器正确地完成其工作,它应该输出N个键值对。 现在问题是如何编程Reducer以输出具有最大'得分'的一个键值对 ?
据我所知,Reducer仅适用于共享相同键的键值对。 由于这个场景中的输出都有不同的键,我猜测应该在Reduce步骤之前完成一些事情。 或者也许应该完全省略Reduce步骤?
It sounds like a simple job, but with MapReduce it doesn't seem that straight-forward.
I have N files in which there is only one line of text for each file. I'd like the Mapper to output key value pairs like < filename, score >, in which 'score' is an integer calculated from the line of text. As a sidenote I am using the below snippet to do so (hope it's correct).
FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName();Assuming the mapper does its job correctly, it should output N key value pairs. Now the problem is how should I program the Reducer to output the one key value pair with the maximum 'score'?
From what I know Reducer only works with key value pairs that share the same key. Since the output in this scenario all have different keys, I am guessing something should be done before the Reduce step. Or perhaps should the Reduce step be omitted altogether?
最满意答案
您可以使用旧API中的setup()和cleanup()方法(configure()和close()方法)。 在reduce类中声明一个全局变量,它确定最大分数。 对于每次调用reduce,您都会将输入值(score)与全局变量进行比较。
在同一reduce任务中的所有reduce调用之前调用Setup()一次。 在同一个reduce任务中的最后一次reduce调用之后调用Cleanup()。 因此,如果您有多个reducer,则会在每个reduce任务上单独调用Setup()和cleanup()方法。
You can use the setup() and cleanup() methods (configure() and close() methods in old API). Declare a global variable in reduce class, which determines the maximum score. For each call to reduce, you would compare the input value (score) with the global variable.
Setup() is called once before all reduce invocations in the same reduce task. Cleanup() is called after last reduce invocation in the same reduce task. So, if you have multiple reducers, Setup() and cleanup() methods would be called separately on each reduce task.
更多推荐
发布评论