我正在使用 spark-sql-2.4.1v如何进行各种连接取决于列的值
I am using spark-sql-2.4.1v how to do various joins depend on the value of column
样本数据
val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 ,14) ) val df = data.toDF("id", "code", "entity", "value1","value2") +---+-----+------+------+------+ | id| code|entity|value1|value2| +---+-----+------+------+------+ | 20|score|school| 14| 12| | 21|score|school| 13| 13| | 22| rate|school| 11| 14| | 21| rate|school| 13| 12|基于代码"我需要与其他各种表连接的列值
based the "code" column value i need to do join with various other tables
val rateDs = // val data1= List( ("22", 11 ,A), ("22", 14 ,B), ("20", 13 ,C), ("21", 12 ,C), ("21", 13 ,D) )val df = data1.toDF(id",map_code",map_val")
val df = data1.toDF("id", "map_code","map_val")
val scoreDs = // scoreTable如果代码"列值为rate"我需要加入rateDs如果代码"列值为分数"我需要加入scoreDs
if the "code" column value is "rate" i need to join with rateDs if the "code" column value is "score" i need to join with scoreDs
如何在 spark 中处理这些事情?实现这一目标的最佳方法是什么?
how to handle these kind of things in spark ? any optimum way to achieve this?
rate"的预期结果字段
+---+-----+------+------+------+ | id| code|entity|value1|value2| +---+-----+------+------+------+ | 22| rate|school| A| B | | 21| rate|school| D| C | 推荐答案你可以简单地加入两次,例如
You can simply join twice, for example
val data = List( ("20", "score", "school", 14 , 12), ("21", "score", "school", 13 , 13), ("22", "rate", "school", 11 , 14), ("21", "rate", "school", 13 , 12) ) val df = data.toDF("id", "code", "entity", "value1","value2") val data1 = List( ("22", 11 ,"A"), ("22", 14 ,"B"), ("20", 13 ,"C"), ("21", 12 ,"C"), ("21", 13 ,"D") ) val rateDF = data1.toDF("id", "map_code","map_val") df.as("a") .join(rateDF.as("b"), col("a.code") === lit("rate") && col("a.id") === col("b.id") && col("a.value1") === col("b.map_code"), "inner") .join(rateDF.as("c"), col("a.code") === lit("rate") && col("a.id") === col("c.id") && col("a.value2") === col("c.map_code"), "inner") .select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2")) .show(false) +---+----+------+------+------+ |id |code|entity|value1|value2| +---+----+------+------+------+ |22 |rate|school|A |B | |21 |rate|school|D |C | +---+----+------+------+------+嗯,这看起来有点脏,但我不知道多列...
Well, this looks a bit dirty, but I have no idea for the multiple columns...
更多推荐
根据列值加入
发布评论