将PySpark数据框中的一行拆分为多行

编程入门 行业动态 更新时间:2024-10-25 12:18:00
本文介绍了将PySpark数据框中的一行拆分为多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我目前有一个数据帧,其中一列的类型为"a b c d e ...".将此列称为"col4"

我想通过拆分col4的元素将一行拆分为多个,并保留所有其他列的值.

例如,给定一个具有单行的df:

col1 [0] | col2 [0] | col3 [0] | a b c |

我希望输出为:

col1 [0] | col2 [0] | col3 [0] |一个|

col1 [0] | col2 [0] | col3 [0] | b |

col1 [0] | col2 [0] | col3 [0] | c |

使用split和explode函数,我尝试了以下操作:

d = COMBINED_DF.select(col1, col2, col3, explode(split(my_fun(col4), " ")))

但是,这将导致以下输出:

col1 [0] | col2 [0] | col3 [0] | a b c |

col1 [0] | col2 [0] | col3 [0] | a b c |

col1 [0] | col2 [0] | col3 [0] | a b c |

这不是我想要的.

解决方案

以下是可重现的示例:

# Create dummy data df = sc.parallelize([(1, 2, 3, 'a b c'), (4, 5, 6, 'd e f'), (7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4']) # Explode column from pyspark.sql.functions import split, explode df.withColumn('col4',explode(split('col4',' '))).show() +----+----+----+----+ |col1|col2|col3|col4| +----+----+----+----+ | 1| 2| 3| a| | 1| 2| 3| b| | 1| 2| 3| c| | 4| 5| 6| d| | 4| 5| 6| e| | 4| 5| 6| f| | 7| 8| 9| g| | 7| 8| 9| h| | 7| 8| 9| i| +----+----+----+----+

I currently have a dataframe where one column is of type "a b c d e ...". Call this column "col4"

I would like to split a single row into multiple by splitting the elements of col4, preserving the value of all the other columns.

So, for example, given a df with single row:

col1[0] | col2[0] | col3[0] | a b c |

I would like the output to be:

col1[0] | col2[0] | col3[0] | a |

col1[0] | col2[0] | col3[0] | b |

col1[0] | col2[0] | col3[0] | c |

Using the split and explode functions, I have tried the following:

d = COMBINED_DF.select(col1, col2, col3, explode(split(my_fun(col4), " ")))

However, this results in the following output:

col1[0] | col2[0] | col3[0] | a b c |

col1[0] | col2[0] | col3[0] | a b c |

col1[0] | col2[0] | col3[0] | a b c |

which is not what I want.

解决方案

Here's a reproducible example:

# Create dummy data df = sc.parallelize([(1, 2, 3, 'a b c'), (4, 5, 6, 'd e f'), (7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4']) # Explode column from pyspark.sql.functions import split, explode df.withColumn('col4',explode(split('col4',' '))).show() +----+----+----+----+ |col1|col2|col3|col4| +----+----+----+----+ | 1| 2| 3| a| | 1| 2| 3| b| | 1| 2| 3| c| | 4| 5| 6| d| | 4| 5| 6| e| | 4| 5| 6| f| | 7| 8| 9| g| | 7| 8| 9| h| | 7| 8| 9| i| +----+----+----+----+

更多推荐

将PySpark数据框中的一行拆分为多行

本文发布于:2023-10-26 19:11:37,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1531092.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:框中   数据   PySpark

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!