PySpark会“爆炸"列中的dict

编程入门 行业动态 更新时间:2024-10-19 08:49:28
本文介绍了PySpark会“爆炸"列中的dict的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我在spark数据框中有一列"true_recoms":

I have a column 'true_recoms' in spark dataframe:

-RECORD 17----------------------------------------------------------------- item | 20380109 true_recoms | {"5556867":1,"5801144":5,"7397596":21}

我需要分解"此列以获得类似以下内容:

I need to 'explode' this column to get something like this:

item | 20380109 recom_item | 5556867 recom_cnt | 1 .............. item | 20380109 recom_item | 5801144 recom_cnt | 5 .............. item | 20380109 recom_item | 7397596 recom_cnt | 21

我尝试使用from_json,但不起作用:

I've tried to use from_json but its doesnt work:

schema_json = StructType(fields=[ StructField("item", StringType()), StructField("recoms", StringType()) ]) df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5) +--------+--------------------+------+ | item| true_recoms|true_r| +--------+--------------------+------+ |31746548|{"32731749":3,"31...| [,]| |17359322|{"17359392":1,"17...| [,]| |31480894|{"31480598":1,"31...| [,]| | 7265665|{"7265891":1,"503...| [,]| |31350949|{"32218698":1,"31...| [,]| +--------+--------------------+------+ only showing top 5 rows

推荐答案

该架构定义不正确.您声明为带有两个字符串字段的struct

The schema is incorrectly defined. You declare to be as struct with two string fields

  • item
  • recoms
  • item
  • recoms

,而文档中都不存在任何字段.

while neither field is present in the document.

不幸的是,from_json只能采用return结构或结构数组,因此将其重新定义为

Unfortunately from_json can take return only structs or array of structs so redefining it as

MapType(StringType(), LongType())

不是一个选择.

我个人会使用udf

from pyspark.sql.functions import udf, explode import json @udf("map<string, bigint>") def parse(s): try: return json.loads(s) except json.JSONDecodeError: pass

可以这样应用

df = spark.createDataFrame( [(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")], ("item", "true_recoms") ) df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show() # +--------+----------+---------+ # | item|recom_item|recom_cnt| # +--------+----------+---------+ # |31746548| 5801144| 5| # |31746548| 7397596| 21| # |31746548| 5556867| 1| # +--------+----------+---------+

更多推荐

PySpark会“爆炸"列中的dict

本文发布于:2023-11-22 05:47:56,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1616198.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:PySpark   quot   dict

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!