Spark中的加权平均值

编程入门 行业动态 更新时间:2024-10-25 01:33:18
本文介绍了Spark中的加权平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有两个RDD,第一个我将呼叫 userVisits ,如下所示:

I have two RDDs, the first I'll call userVisits that looks like this:

((123, someurl,Mon Nov 04 00:00:00 PST 2013),11.0)

第二个是全部访问次数:

and the second is allVisits:

((someurl,Mon Nov 04 00:00:00 PST 2013),1122.0)

我可以执行 userVisits.reduceByKey(_ + _)可以获取该用户的访问次数.我可以进行allVisits并获得相同的结果.我想做的是获得用户的加权平均值,除以用户访问次数除以当天的总访问次数.我需要使用用户访问中的部分键元组在allVisits中查找一个值.我猜可以用这样的地图来做到这一点:

I can do userVisits.reduceByKey(_+_) can get the number of visits by that user. I can do allVisits and get the same. What I want to do is get a weighted average for the users dividing the users visits by the total visits for the day. I need to lookup a value in allVisits with part of the key tuple in user visits. I'm guessing it could be done with a map like this:

userVisits.reduceByKey(_+_).map( item => item._2 / allVisits.get(item._1))

我知道 allVisits.get(key)不存在,但是我该怎么做呢?

I know allVisits.get(key) doesn't exist, but how could I accomplish something like that?

另一种方法是从allVisits获取密钥,并从userVisits映射每个密钥数目,然后将两者结合,但这似乎效率很低.

The alternative is getting the keys from allVisits and mapping each number of keys from userVisits then joining the two, but that seems inefficient.

推荐答案

我在这里看到的唯一通用选项是 join :

The only universal option I see here is join:

val userVisitsAgg = userVisits.reduceByKey(_ + _) val allVisitsAgg = allVisits.reduceByKey(_ + _) userVisitsAgg.map{case ((id, url, date), sum) => ((url, date), (id, sum))} .join(allVisitsAgg) .map{case ((url, date), ((id, userSum), (urlSum))) => ((id, url, date), userSum / urlSum)}

如果 allVisitsAgg 足够小,可以广播,您可以在上面简化为类似以下内容:

If allVisitsAgg is small enough to be broadcasted you can simplify above to something like this:

val allVisitsAggBD = sc.broadcast(allVisitsAgg.collectAsMap) userVisitsAgg.map{case ((id, url, date), sum) => ((id, url), sum / allVisitsAggBD.value((url, date))) }

更多推荐

Spark中的加权平均值

本文发布于:2023-11-22 13:12:53,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1617551.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:平均值   Spark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!