我有一个DataFrame df ,它有列type和subtype以及大约100k行,我试图通过检查type / subtype组合来分类df包含的数据subtype 。 虽然df可以包含许多不同的组合,但是存在仅出现在某些数据类型中的特定组合。 要检查我的对象是否包含我正在执行的任何这些组合:
typeA = ((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | (df.subtype == 5) | (df.subtype == 6))) | ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | (df.subtype == 8))) A = typeA.sum()其中typeA是一个很长的Falses系列,可能有一些Trues,如果A> 0,那么我知道它包含一个True。 这个方案的问题是,如果df的第一行产生一个True,它仍然需要检查其他所有内容。 检查整个DataFrame比使用带有break的for循环更快,但我想知道是否有更好的方法来执行它。
谢谢你的任何建议。
I have a DataFrame df that has columns type and subtype and about 100k rows, I'm trying to classify what kind of data df contains by checking type / subtype combinations. While df can contain many different combinations there are particular combinations that only appear in certain data types. To check if my objects contains any of these combinations I'm currently doing:
typeA = ((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | (df.subtype == 5) | (df.subtype == 6))) | ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | (df.subtype == 8))) A = typeA.sum()Where typeA is a long Series of Falses that might have some Trues, if A > 0 then I know it contained a True. The problem with this scheme is that if the first row of the df produces a True it still has to check everything else. Checking the whole DataFrame is faster then using a for loop with a break, but I'm wondering if there is a better way to do it.
Thanks for any suggestions.
最满意答案
使用crosstab :
import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0, 10, size=(100, 2)), columns=["type", "subtype"]) counts = pd.crosstab(df.type, df.subtype) print counts.loc[0, [2, 3, 5, 6]].sum() + counts.loc[5, [3, 4, 7, 8]].sum()结果如下:
a = (((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | (df.subtype == 5) | (df.subtype == 6))) | ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | (df.subtype == 8)))) a.sum()use crosstab:
import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0, 10, size=(100, 2)), columns=["type", "subtype"]) counts = pd.crosstab(df.type, df.subtype) print counts.loc[0, [2, 3, 5, 6]].sum() + counts.loc[5, [3, 4, 7, 8]].sum()the result is same as:
a = (((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | (df.subtype == 5) | (df.subtype == 6))) | ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | (df.subtype == 8)))) a.sum()更多推荐
发布评论