我有如下格式的数据
data = [["a", "b", "c"], ["b", "c"], ["d", "e", "f", "c"]]并且我想要一个 DataFrame,其中所有唯一的字符串都作为列和出现的二进制值等
and I would like to have a DataFrame with all unique strings as columns and binary values of occurrence as such
a b c d e f 0 1 1 1 0 0 0 1 0 1 1 0 0 0 2 0 0 1 1 1 1我有一个使用列表推导式的工作代码,但对于大数据来说速度很慢.
I have a working code using list comprehensions but it's pretty slow for large data.
# vocab_list contains all the unique keys, which is obtained when reading in data from file df = pd.DataFrame([[1 if word in entry else 0 for word in vocab_list] for entry in data])有没有办法优化这个任务?谢谢.
Is there any way to optimise this task? Thanks.
编辑(实际数据的小样本):
EDIT (a small sample of actual data):
[['a','关于','荒诞','再次','一个','同事','写','写','X','约克','你','你的'],['一种','坚持','年龄','加重','积极地','全部','几乎','独自的','已经','还','虽然']]
[['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]
推荐答案为了获得更好的性能,请使用 MultiLabelBinarizer:
For better performance use MultiLabelBinarizer:
data = [["a", "b", "c"], ["b", "c"], ["d", "e", "f", "c"]] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_) print (df) a b c d e f 0 1 1 1 0 0 0 1 0 1 1 0 0 0 2 0 0 1 1 1 1 data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_) print (df) a abiding about absurd again age aggravated aggressively all \ 0 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 almost ... also although an associates writes wrote x york you \ 0 0 ... 0 0 1 1 1 1 1 1 1 1 1 ... 1 1 0 0 0 0 0 0 0 your 0 1 1 0 [2 rows x 22 columns]纯熊猫解决方案是可能的,但我想它应该更慢:
Pure pandas solution is possible, but I guess it should be slowier:
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1) print (df) a b d c e f 0 1 1 0 1 0 0 1 0 1 0 1 0 0 2 0 0 1 1 1 1 df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1) print (df) a abiding about absurd age again aggravated aggressively an all \ 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 ... writes alone wrote already x also york although you your 0 ... 1 0 1 0 1 0 1 0 1 1 1 ... 0 1 0 1 0 1 0 1 0 0 [2 rows x 22 columns]更多推荐
如何从具有不同长度的列表列表中创建 Pandas DataFrame?
发布评论