如何从具有不同长度的列表列表中创建 Pandas DataFrame?

编程入门行业动态更新时间:2024-10-06 10:35:08

本文介绍了如何从具有不同长度的列表列表中创建 Pandas DataFrame?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有如下格式的数据

data = [["a", "b", "c"], ["b", "c"], ["d", "e", "f", "c"]]

并且我想要一个 DataFrame，其中所有唯一的字符串都作为列和出现的二进制值等

and I would like to have a DataFrame with all unique strings as columns and binary values of occurrence as such

a b c d e f 0 1 1 1 0 0 0 1 0 1 1 0 0 0 2 0 0 1 1 1 1

我有一个使用列表推导式的工作代码，但对于大数据来说速度很慢.

I have a working code using list comprehensions but it's pretty slow for large data.

# vocab_list contains all the unique keys, which is obtained when reading in data from file df = pd.DataFrame([[1 if word in entry else 0 for word in vocab_list] for entry in data])

有没有办法优化这个任务?谢谢.

Is there any way to optimise this task? Thanks.

编辑(实际数据的小样本):

EDIT (a small sample of actual data):

[['a','关于'，'荒诞'，'再次'，'一个'，'同事','写'，'写','X'，'约克','你'，'你的']，['一种'，'坚持','年龄'，'加重','积极地','全部'，'几乎'，'独自的'，'已经'，'还'，'虽然']]

[['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]

推荐答案

为了获得更好的性能，请使用 MultiLabelBinarizer:

For better performance use MultiLabelBinarizer:

data = [["a", "b", "c"], ["b", "c"], ["d", "e", "f", "c"]] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_) print (df) a b c d e f 0 1 1 1 0 0 0 1 0 1 1 0 0 0 2 0 0 1 1 1 1

data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_) print (df) a abiding about absurd again age aggravated aggressively all \ 0 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 almost ... also although an associates writes wrote x york you \ 0 0 ... 0 0 1 1 1 1 1 1 1 1 1 ... 1 1 0 0 0 0 0 0 0 your 0 1 1 0 [2 rows x 22 columns]

纯熊猫解决方案是可能的，但我想它应该更慢:

Pure pandas solution is possible, but I guess it should be slowier:

df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1) print (df) a b d c e f 0 1 1 0 1 0 0 1 0 1 0 1 0 0 2 0 0 1 1 1 1 df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1) print (df) a abiding about absurd age again aggravated aggressively an all \ 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 ... writes alone wrote already x also york although you your 0 ... 1 0 1 0 1 0 1 0 1 1 1 ... 0 1 0 1 0 1 0 1 0 0 [2 rows x 22 columns]