熊猫如何在不寻常的文本顺序中进行分解(Pandas how to Factorize in Unusual Text Order)

系统教程行业动态更新时间:2024-06-14 16:57:18

我有一个数据框，其列'cat100'具有如下值：

'A''B'......'Y''Z''AA''AB'......

我想使用pd.factorize对列进行分解，使AA在'B''C'...'Z'之后。

我尝试过类似的东西：

df = pd.DataFrame(['A','B','AA']) df[0] = pd.factorize(df[0], sort=True)[0]

但这会将A分配给0，B分配给2，AA分配给1.我希望AA分配给2，B分配给1。

我一直在寻找方法来做到这一点并没有找到任何东西。有没有办法做到这一点？

I have a dataframe that has a column 'cat100' that has values like the following:

'A' 'B' ... 'Y' 'Z' 'AA' 'AB' ...

I would like to factorize the column using pd.factorize such that AA is after 'B' 'C' ... 'Z'.

I've tried something like:

df = pd.DataFrame(['A','B','AA']) df[0] = pd.factorize(df[0], sort=True)[0]

But this assigns A to 0, B to 2, and AA to 1. I want AA to be assigned to 2 and B to 1.

I've searched for ways to do this and haven't found anything. Is there a way to do this?

最满意答案

考虑带有字符串列的DF ，如下所示：

df = pd.DataFrame(dict(col=['A','B','AA','C','BB','AAA','BC','AB','AA'])) df

在此处输入图像描述

自定义功能：

（i）从正在考虑的专栏中获取唯一条目。（ii）按字符串长度Groupby ，并按字典顺序对它们进行排序并水平堆叠。（iii）将它们分解。

def complex_factorize(df, col): ser = pd.Series(df[col].unique()) func = lambda x: sorted(x.values.ravel()) arr = np.hstack(ser.groupby(ser.str.len()).apply(func).values) return pd.factorize(arr)

获取factorize方法返回的标签和系列的唯一元素，将其提供给DF.replace以构建映射。

val, ser = complex_factorize(df, 'col') df.replace(ser, val)

在此处输入图像描述

Consider a DF with a string column as shown:

df = pd.DataFrame(dict(col=['A','B','AA','C','BB','AAA','BC','AB','AA'])) df

enter image description here

Custom Function:

(i) Take unique entries from the column under consideration. (ii) Groupby by string lengths and sort these lexicographically and stack them horizontally. (iii) Factorize them.

def complex_factorize(df, col): ser = pd.Series(df[col].unique()) func = lambda x: sorted(x.values.ravel()) arr = np.hstack(ser.groupby(ser.str.len()).apply(func).values) return pd.factorize(arr)

Taking the labels and the unique elements of the series returned by the factorize method, feed it to DF.replace to construct the mapping.

val, ser = complex_factorize(df, 'col') df.replace(ser, val)

enter image description here

更多推荐

本文发布于:2023-04-12 20:56:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/dzcp/61489c73517b41dd148cc720ee4b723d.html