我在群集上有一个RDD分区,我想在每个分区上分别做 reduceByKey .我不希望将reduceByKey在分区上的结果合并到一起.我想防止Spark在集群中随机播放reduceByKey的中间结果.
I have an RDD partitioned on the cluster and I want to do reduceByKey on each partition separately. I don't want result of reduceByKey on partitions to be merged together. I want to prevent Spark to do shuffle intermediate results of reduceByKey in the cluster.
下面的代码不起作用,但我想这样:
The below code does not work but I want sth like this:
myPairedRDD.mapPartitions({iter => iter.reduceByKey((x, y) => x + y)})我该如何实现?
推荐答案您可以尝试一些
myPairedRDD.mapPartitions(iter => iter.groupBy(_._1).mapValues(_.map(_._2).reduce(_ + _)).iterator )或保持更多的内存效率(这里我假设 myPairedRDD 是 RDD [(String,Double)] .请调整类型以匹配您的用例):
or to keep things more memory efficient (here I assume that myPairedRDD is RDD[(String, Double)]. Please adjust types to match your use case):
myPairedRDD.mapPartitions(iter => iter.foldLeft(mutable.Map[String, Double]().withDefaultValue(0.0)){ case (acc, (k, v)) => {acc(k) += v; acc} }.iterator )但是请注意,与改组操作不同,它不能从内存中卸载数据.
but please note, that unlike shuffling operations, it cannot offload data from memory.
更多推荐
在RDD的每个分区上分别进行reduceByKey而不汇总结果
发布评论