在 pyspark 数据框中解压缩元组列表

编程入门行业动态更新时间:2024-10-28 00:23:23

本文介绍了在 pyspark 数据框中解压缩元组列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

我想在 pyspark 数据帧的列中解压缩元组列表

I want unzip list of tuples in a column of a pyspark dataframe

假设一列为[(blue, 0.5), (red, 0.1), (green, 0.7)]，我想分成两列，第一列为[blue, red, green] 和第二列为 [0.5, 0.1, 0.7]

Let's say a column as [(blue, 0.5), (red, 0.1), (green, 0.7)], I want to split into two columns, with first column as [blue, red, green] and second column as [0.5, 0.1, 0.7]

+-----+-------------------------------------------+
|Topic|  Tokens                                   |
+-----+-------------------------------------------+
|    1|  ('blue', 0.5),('red', 0.1),('green', 0.7)|
|    2|  ('red', 0.9),('cyan', 0.5),('white', 0.4)|
+-----+-------------------------------------------+

可以使用此代码创建:

df = sqlCtx.createDataFrame(
    [
        (1, ('blue', 0.5),('red', 0.1),('green', 0.7)),
        (2, ('red', 0.9),('cyan', 0.5),('white', 0.4))
    ],
    ('Topic', 'Tokens')
)

并且，输出应如下所示:

And, the output should look like:

+-----+--------------------------+-----------------+
|Topic|  Tokens                  | Weights         |
+-----+--------------------------+-----------------+
|    1|  ['blue', 'red', 'green']| [0.5, 0.1, 0.7] |
|    2|  ['red', 'cyan', 'white']| [0.9, 0.5, 0.4] |
+-----+--------------------------------------------+

推荐答案

如果你的 DataFrame 的架构看起来像这样:

If schema of your DataFrame looks like this:

 root
  |-- Topic: long (nullable = true)
  |-- Tokens: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- _1: string (nullable = true)
  |    |    |-- _2: double (nullable = true)

然后您可以选择:

from pyspark.sql.functions import col

df.select(
    col("Topic"),
    col("Tokens._1").alias("Tokens"), col("Tokens._2").alias("weights")
).show()
# +-----+------------------+---------------+       
# |Topic|            Tokens|        weights|
# +-----+------------------+---------------+
# |    1|[blue, red, green]|[0.5, 0.1, 0.7]|
# |    2|[red, cyan, white]|[0.9, 0.5, 0.4]|
# +-----+------------------+---------------+

并概括:

cols = [
    col("Tokens.{}".format(n)) for n in 
    df.schema["Tokens"].dataType.elementType.names]

df.select("Topic", *cols)

参考使用复杂类型查询 Spark SQL DataFrame

这篇关于在 pyspark 数据框中解压缩元组列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-18 21:49:19，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/947855.html

解压缩框中数据列表 pyspark

上一篇：南通万豪酒店开业；诺瓦瓦克斯任命新任总裁兼CEO
下一篇：首旅如家蝉联世界酒店集团10强，中高端酒店开业超1000家

发布评论取消回复

评论列表（有 0 条评论）

在 pyspark 数据框中解压缩元组列表

问题描述

推荐答案

发布评论取消回复

最近发表

热门文章

标签列表