我有一个数据框,其列'cat100'具有如下值:
'A''B'......'Y''Z''AA''AB'......
我想使用pd.factorize对列进行分解,使AA在'B''C'...'Z'之后。
我尝试过类似的东西:
df = pd.DataFrame(['A','B','AA']) df[0] = pd.factorize(df[0], sort=True)[0]但这会将A分配给0,B分配给2,AA分配给1.我希望AA分配给2,B分配给1。
我一直在寻找方法来做到这一点并没有找到任何东西。 有没有办法做到这一点?
I have a dataframe that has a column 'cat100' that has values like the following:
'A' 'B' ... 'Y' 'Z' 'AA' 'AB' ...
I would like to factorize the column using pd.factorize such that AA is after 'B' 'C' ... 'Z'.
I've tried something like:
df = pd.DataFrame(['A','B','AA']) df[0] = pd.factorize(df[0], sort=True)[0]But this assigns A to 0, B to 2, and AA to 1. I want AA to be assigned to 2 and B to 1.
I've searched for ways to do this and haven't found anything. Is there a way to do this?
最满意答案
考虑带有字符串列的DF ,如下所示:
df = pd.DataFrame(dict(col=['A','B','AA','C','BB','AAA','BC','AB','AA'])) df
自定义功能:
(i)从正在考虑的专栏中获取唯一条目。 (ii)按字符串长度Groupby ,并按字典顺序对它们进行排序并水平堆叠。 (iii)将它们分解。
def complex_factorize(df, col): ser = pd.Series(df[col].unique()) func = lambda x: sorted(x.values.ravel()) arr = np.hstack(ser.groupby(ser.str.len()).apply(func).values) return pd.factorize(arr)获取factorize方法返回的标签和系列的唯一元素,将其提供给DF.replace以构建映射。
val, ser = complex_factorize(df, 'col') df.replace(ser, val)
Consider a DF with a string column as shown:
df = pd.DataFrame(dict(col=['A','B','AA','C','BB','AAA','BC','AB','AA'])) dfCustom Function:
(i) Take unique entries from the column under consideration. (ii) Groupby by string lengths and sort these lexicographically and stack them horizontally. (iii) Factorize them.
def complex_factorize(df, col): ser = pd.Series(df[col].unique()) func = lambda x: sorted(x.values.ravel()) arr = np.hstack(ser.groupby(ser.str.len()).apply(func).values) return pd.factorize(arr)Taking the labels and the unique elements of the series returned by the factorize method, feed it to DF.replace to construct the mapping.
val, ser = complex_factorize(df, 'col') df.replace(ser, val)更多推荐
发布评论