是否存在用于在pyspark中对分类数据实现自定义排序顺序的任何推荐方法?理想情况下,我正在寻找pandas类别数据类型提供的功能.
Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers.
因此,给定具有 Speed 列的数据集,可能的选项为 [超快速",快速",中",慢"] .我想实现适合上下文的自定义排序.
So, given a dataset with a Speed column, the possible options are ["Super Fast", "Fast", "Medium", "Slow"]. I want to implement custom sorting that will fit the context.
如果我使用默认排序,则类别将按字母顺序排序.Pandas允许将列数据类型更改为分类,并且部分定义提供了自定义排序顺序: pandas.pydata/pandas-docs/stable/reference/api/pandas.Categorical.html
If I use the default sorting the categories will be sorted alphabetically. Pandas allows to change the column data type to be categorical and part of the definition gives a custom sort order: pandas.pydata/pandas-docs/stable/reference/api/pandas.Categorical.html
推荐答案您可以使用 orderBy 并使用 when 时定义自定义顺序:
You can use orderBy and define your custom ordering using when:
from pyspark.sql.functions col, when df.orderBy(when(col("Speed") == "Super Fast", 1) .when(col("Speed") == "Fast", 2) .when(col("Speed") == "Medium", 3) .when(col("Speed") == "Slow", 4) )更多推荐
pyspark数据帧中的自定义排序
发布评论