我打印JavaRDD时,我的数据如下所示: [[String1,String2,String3],[String4],[String5,String6],[String7,String8,String9]]
每个字符串都是管道分隔的字符串。我可以将每个字符串拆分为一个键和值。
如何将此RDD转换为JavaPairRDD?
JavaRDD< List< String>> 中有这样的数据: List_0:[sub10〜sub11〜sub12,sub20〜sub21〜sub22,sub30〜sub31〜sub32] List_1:[sub40 〜sub41〜sub42]其中〜是分隔符。
而且你想平整列表并用 | 分组第一和第三个子字符串,作为每个输入字符串的关键字,然后将对存储在 JavaPairRDD< String,String> 中:
键:sub10 | sub12value:sub10〜sub11〜sub12您可以通过使用 flatMap 然后 mapToPair 来实现此目的:
rdd.flatMap(new FlatMapFunction< List< String>,String>(){ public Iterable<串GT;调用(List< String> li)抛出异常{ return li; } ))。mapToPair(new PairFunction< String,String,String>(){ public Tuple2< String,String> call(String s)throws Exception { String [] ss = s.split(〜); 返回新的Tuple2< String,String>(ss [0] +|+ ss [2],s); } });
I have a JavaRDD when I print it my data looks like this [[String1,String2,String3],[String4],[String5,String6],[String7,String8,String9]]
Each String is in turn a pipe separated strings. I can split each string to form a key and value.
How can I convert this RDD to a JavaPairRDD?
解决方案Assuming you have such data in JavaRDD<List<String>>:
List_0: ["sub10~sub11~sub12","sub20~sub21~sub22","sub30~sub31~sub32"] List_1: ["sub40~sub41~sub42"]Where ~ is the separator.
And you want to flat the lists and group the first and the third sub string with | as the key for each input string, then store pairs in JavaPairRDD<String,String>:
key: "sub10|sub12" value: "sub10~sub11~sub12"You could achieve this by using flatMap and then mapToPair:
rdd.flatMap(new FlatMapFunction<List<String>,String>() { public Iterable<String> call(List<String> li) throws Exception { return li; } }).mapToPair(new PairFunction<String,String,String>() { public Tuple2<String, String> call(String s) throws Exception { String[] ss = s.split("~"); return new Tuple2<String,String>(ss[0] + "|" + ss[2], s); } });
更多推荐
如何转换JavaRDD<< List< String>>到JavaPairRDD&
发布评论