Spark Dataframe Group具有新指标列(Spark Dataframe Group by having New Indicator Column)

编程入门 行业动态 更新时间:2024-10-27 12:40:25
Spark Dataframe Group具有新指标列(Spark Dataframe Group by having New Indicator Column)

我需要按“KEY”列进行分组,并且需要检查“TYPE_CODE”列是否同时具有“PL”和“JL”值,如果是,那么我需要将指标列添加为“Y”,否则为“N”

示例:

//Input Values val values = List(List("66","PL") , List("67","JL") , List("67","PL"),List("67","PO"), List("68","JL"),List("68","PO")).map(x =>(x(0), x(1))) import spark.implicits._ //created a dataframe val cmc = values.toDF("KEY","TYPE_CODE") cmc.show(false) ------------------------ KEY |TYPE_CODE | ------------------------ 66 |PL | 67 |JL | 67 |PL | 67 |PO | 68 |JL | 68 |PO | -------------------------

预期产出:

对于每个“KEY”,如果它具有“TYPE_CODE”则同时具有PL和JL,则Y否则为N.

----------------------------------------------------- KEY |TYPE_CODE | Indicator ----------------------------------------------------- 66 |PL | N 67 |JL | Y 67 |PL | Y 67 |PO | Y 68 |JL | N 68 |PO | N ---------------------------------------------------

例如,67有PL和JL - 所以“Y”66只有PL - 所以“N”68只有JL - 所以“N”

I need to group by "KEY" Column and need to check whether "TYPE_CODE" column has both "PL" and "JL" values , if so then i need to add a Indicator Column as "Y" else "N"

Example :

//Input Values val values = List(List("66","PL") , List("67","JL") , List("67","PL"),List("67","PO"), List("68","JL"),List("68","PO")).map(x =>(x(0), x(1))) import spark.implicits._ //created a dataframe val cmc = values.toDF("KEY","TYPE_CODE") cmc.show(false) ------------------------ KEY |TYPE_CODE | ------------------------ 66 |PL | 67 |JL | 67 |PL | 67 |PO | 68 |JL | 68 |PO | -------------------------

Expected Output :

For each "KEY", If it has "TYPE_CODE" has both PL & JL then Y else N

----------------------------------------------------- KEY |TYPE_CODE | Indicator ----------------------------------------------------- 66 |PL | N 67 |JL | Y 67 |PL | Y 67 |PO | Y 68 |JL | N 68 |PO | N ---------------------------------------------------

For example, 67 has both PL & JL - So "Y" 66 has only PL - So "N" 68 has only JL - So "N"

最满意答案

一种选择:

1)收集TYPE_CODE作为列表;

2)检查它是否包含特定的字符串;

3)然后用explode展平列表:

(cmc.groupBy("KEY") .agg(collect_list("TYPE_CODE").as("TYPE_CODE")) .withColumn("Indicator", when(array_contains($"TYPE_CODE", "PL") && array_contains($"TYPE_CODE", "JL"), "Y").otherwise("N")) .withColumn("TYPE_CODE", explode($"TYPE_CODE"))).show +---+---------+---------+ |KEY|TYPE_CODE|Indicator| +---+---------+---------+ | 68| JL| N| | 68| PO| N| | 67| JL| Y| | 67| PL| Y| | 67| PO| Y| | 66| PL| N| +---+---------+---------+

One option:

1) collect TYPE_CODE as list;

2) check if it contains the specific strings;

3) then flatten the list with explode:

(cmc.groupBy("KEY") .agg(collect_list("TYPE_CODE").as("TYPE_CODE")) .withColumn("Indicator", when(array_contains($"TYPE_CODE", "PL") && array_contains($"TYPE_CODE", "JL"), "Y").otherwise("N")) .withColumn("TYPE_CODE", explode($"TYPE_CODE"))).show +---+---------+---------+ |KEY|TYPE_CODE|Indicator| +---+---------+---------+ | 68| JL| N| | 68| PO| N| | 67| JL| Y| | 67| PL| Y| | 67| PO| Y| | 66| PL| N| +---+---------+---------+

更多推荐

本文发布于:2023-07-23 04:39:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1227564.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:指标   Dataframe   Spark   Group   Column

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!