两种实现)"/>
MapReduce经典案例爷孙关系(自连接MapJoin和ReduceJoin的两种实现)
需求:
求出所有具有爷孙关系的人
数据来源:
儿子 父亲
Tom Lucy
Tom Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Ben
Jack Alice
Jack Jesse
Terry Alice
Terry Jesse
Philip Terry
Philip Alma
Mark Terry
Mark Alma
reduce-join 处理结果:
孙子 父亲 爷爷
Tom Jack Alice
Tom Jack Jesse
Jone Jack Alice
Jone Jack Jesse
Tom Lucy Ben
Tom Lucy Mary
Jone Lucy Ben
Jone Lucy Mary
Philip Terry Alice
Philip Terry Jesse
Mark Terry Alice
Mark Terry Jesse
map-join 处理结果:
Tom Lucy Ben
Tom Jack Jesse
Jone Lucy Ben
Jone Jack Jesse
Lucy Mary null
Lucy Ben null
Jack Alice null
Jack Jesse null
Terry Alice null
Terry Jesse null
Philip Terry Jesse
Philip Alma null
Mark Terry Jesse
Mark Alma null
reduce-join 分析:
寻找三代关系其本质就是一张表的自连接,而自连接的键值就是父亲。所以首先需要将数据进行叠加输出,以用来产生两张表。但是叠加的同时需要把数据进行翻转(也就是mapkey一致)才能达到自连接的效果,还需要注意的是,在发送到reducer的时候需要标识数据来源否则无法区分是孙子还是爷爷。
reduce-join 代码实现:
package mr.day04;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;
import java.util.LinkedList;/*** @ClassName: SonDriver* @Description: * @Author: xuezhouyi* @Version: V1.0**/
public class SonDriver {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configurat
更多推荐
MapReduce经典案例爷孙关系(自连接MapJoin和ReduceJoin的两种实现)
发布评论