在Apache Spark中联接文件

编程入门行业动态更新时间:2024-10-09 21:22:50

本文介绍了在Apache Spark中联接文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个像这样的文件. code_count.csv

I have a file like this. code_count.csv

code,count,year AE,2,2008 AE,3,2008 BX,1,2005 CD,4,2004 HU,1,2003 BX,8,2004

另一个这样的文件. details.csv

Another file like this. details.csv

code,exp_code AE,Aerogon international BX,Bloomberg Xtern CD,Classic Divide HU,Honololu

我想要每个代码的总和，但是在最终输出中，我想要exp_code.像这样

I want the total sum for each code but in the final output, I want the exp_code. Like this

Aerogon international,5 Bloomberg Xtern,4 Classic Divide,4

这是我的代码

var countData=sc.textFile("C:\path\to\code_count.csv") var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1)) var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)}) sum.take(2)

给予

Array[(String, Int)] = Array((AE,5), (BX,9))

这里的总和是RDD [(String，Int)].我对如何从其他文件中提取exp_code感到困惑.请指导.

Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.

推荐答案

您需要使用代码在groupby之后计算总和，然后加入另一个数据帧.下面是类似的示例.

You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.

import spark.implicits._ val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004))) .toDF("code","count","year") val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"), ("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code") val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count")) val finalDF = sumdf1.join(df2, "code").drop("code") finalDF.show()

更多推荐

在Apache Spark中联接文件

本文发布于:2023-10-28 10:46:55，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1536413.html