本文介绍了在Apache Spark中联接文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个像这样的文件. code_count.csv
I have a file like this. code_count.csv
code,count,year AE,2,2008 AE,3,2008 BX,1,2005 CD,4,2004 HU,1,2003 BX,8,2004另一个这样的文件. details.csv
Another file like this. details.csv
code,exp_code AE,Aerogon international BX,Bloomberg Xtern CD,Classic Divide HU,Honololu我想要每个代码的总和,但是在最终输出中,我想要exp_code.像这样
I want the total sum for each code but in the final output, I want the exp_code. Like this
Aerogon international,5 Bloomberg Xtern,4 Classic Divide,4这是我的代码
var countData=sc.textFile("C:\path\to\code_count.csv") var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1)) var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)}) sum.take(2)给予
Array[(String, Int)] = Array((AE,5), (BX,9))这里的总和是RDD [(String,Int)].我对如何从其他文件中提取exp_code感到困惑.请指导.
Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.
推荐答案您需要使用代码在groupby之后计算总和,然后加入另一个数据帧.下面是类似的示例.
You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.
import spark.implicits._ val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004))) .toDF("code","count","year") val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"), ("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code") val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count")) val finalDF = sumdf1.join(df2, "code").drop("code") finalDF.show()更多推荐
在Apache Spark中联接文件
发布评论