按键排序,但是使用Scala值具有多个元素

编程入门 行业动态 更新时间:2024-10-07 02:19:40
本文介绍了按键排序,但是使用Scala值具有多个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我是Spark上的Scala的新手,想知道如何创建键值对,而键具有多个元素.例如,我有一个用于婴儿名字的数据集:

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:

年份,名称,县,号码

2000,JOHN,KINGS,50岁

2000, JOHN, KINGS, 50

2000,BOB,KINGS,40

2000, BOB, KINGS, 40

2000,玛丽,NASSAU,60岁

2000, MARY, NASSAU, 60

2001,JOHN,KINGS,14

2001, JOHN, KINGS, 14

2001,简,金斯,30岁

2001, JANE, KINGS, 30

2001,BOB,NASSAU,45

2001, BOB, NASSAU, 45

我想找到每个县最常发生的事,与年份无关.我该怎么做呢?

And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?

我确实使用循环来完成此任务.请参考以下内容.但是我想知道是否有更短的方法利用Spark和Scala对偶性. (即我可以减少计算时间吗?)

I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)

val names = sc.textFile("names.csv").map(l => l.split(",")) val uniqueCounty = names.map(x => x(2)).distinct.collect for (i <- 0 to uniqueCounty.length-1) { val county = uniqueCounty(i).toString; val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2); println("County:" + county + eachCounty.first) }

推荐答案

以下是使用RDD的解决方案.我假设您需要每个县名列前茅的名字.

Here is the solution using RDD. I am assuming you need top occurring name per county.

val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45)) val rdd = sc.parallelize(data) //Reduce the uniq values for county/name as combo key val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_) // Group names per county. val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey() // Sort and take the top name alone per county countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect

输出:

res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))

更多推荐

按键排序,但是使用Scala值具有多个元素

本文发布于:2023-11-26 23:09:12,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1635580.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:多个   按键   元素   Scala

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!