使用ReduceByKey对值列表进行分组

编程入门 行业动态 更新时间:2024-10-26 04:26:33
本文介绍了使用ReduceByKey对值列表进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我想对每个键的值列表进行分组,并且正在执行以下操作:

I want to group list of values per key and was doing something like this:

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).groupByKey().collect.foreach(println) (red,CompactBuffer(zero, two)) (yellow,CompactBuffer(one))

但是我注意到Databricks的一篇博客文章,建议不要对大型数据集使用groupByKey.

But I noticed a blog post from Databricks and it's recommending not to use groupByKey for large dataset.

避免使用GroupByKey

是否有一种方法可以使用reduceByKey达到相同的结果?

Is there a way to achieve the same result using reduceByKey?

我尝试过此操作,但它是将所有值连接在一起.顺便说一下,就我而言,键和值都是字符串类型.

I tried this but it's concatenating all values. By the way, for my case, both key and value are string type.

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))).reduceByKey(_ ++ _).collect.foreach(println) (red,zerotwo) (yellow,one)

推荐答案

使用aggregateByKey:

sc.parallelize(Array(("red", "zero"), ("yellow", "one"), ("red", "two"))) .aggregateByKey(ListBuffer.empty[String])( (numList, num) => {numList += num; numList}, (numList1, numList2) => {numList1.appendAll(numList2); numList1}) .mapValues(_.toList) .collect() scala> Array[(String, List[String])] = Array((yellow,List(one)), (red,List(zero, two)))

有关aggregateByKey的详细信息,请参见此答案, 此链接,以了解使用可变数据集ListBuffer的背后原理.

See this answer for the details on aggregateByKey, this link for the rationale behind using a mutable dataset ListBuffer.

Is there a way to achieve the same result using reduceByKey?

以上实际上是更糟糕的性能,请查看@ zero323的评论以获取详细信息.

The above is actually worse in performance, please see comments by @zero323 for the details.

更多推荐

使用ReduceByKey对值列表进行分组

本文发布于:2023-10-24 08:29:34,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1523448.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:列表   ReduceByKey

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!