如何从具有不同长度的列表列表中创建 Pandas DataFrame?

编程入门 行业动态 更新时间:2024-10-06 10:35:08
本文介绍了如何从具有不同长度的列表列表中创建 Pandas DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有如下格式的数据

data = [["a", "b", "c"], ["b", "c"], ["d", "e", "f", "c"]]

并且我想要一个 DataFrame,其中所有唯一的字符串都作为列和出现的二进制值等

and I would like to have a DataFrame with all unique strings as columns and binary values of occurrence as such

a b c d e f 0 1 1 1 0 0 0 1 0 1 1 0 0 0 2 0 0 1 1 1 1

我有一个使用列表推导式的工作代码,但对于大数据来说速度很慢.

I have a working code using list comprehensions but it's pretty slow for large data.

# vocab_list contains all the unique keys, which is obtained when reading in data from file df = pd.DataFrame([[1 if word in entry else 0 for word in vocab_list] for entry in data])

有没有办法优化这个任务?谢谢.

Is there any way to optimise this task? Thanks.

编辑(实际数据的小样本):

EDIT (a small sample of actual data):

[['a','关于','荒诞','再次','一个','同事','写','写','X','约克','你','你的'],['一种','坚持','年龄','加重','积极地','全部','几乎','独自的','已经','还','虽然']]

[['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]

推荐答案

为了获得更好的性能,请使用 MultiLabelBinarizer:

For better performance use MultiLabelBinarizer:

data = [["a", "b", "c"], ["b", "c"], ["d", "e", "f", "c"]] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_) print (df) a b c d e f 0 1 1 1 0 0 0 1 0 1 1 0 0 0 2 0 0 1 1 1 1

data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_) print (df) a abiding about absurd again age aggravated aggressively all \ 0 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 almost ... also although an associates writes wrote x york you \ 0 0 ... 0 0 1 1 1 1 1 1 1 1 1 ... 1 1 0 0 0 0 0 0 0 your 0 1 1 0 [2 rows x 22 columns]

纯熊猫解决方案是可能的,但我想它应该更慢:

Pure pandas solution is possible, but I guess it should be slowier:

df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1) print (df) a b d c e f 0 1 1 0 1 0 0 1 0 1 0 1 0 0 2 0 0 1 1 1 1 df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1) print (df) a abiding about absurd age again aggravated aggressively an all \ 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 ... writes alone wrote already x also york although you your 0 ... 1 0 1 0 1 0 1 0 1 1 1 ... 0 1 0 1 0 1 0 1 0 0 [2 rows x 22 columns]

更多推荐

如何从具有不同长度的列表列表中创建 Pandas DataFrame?

本文发布于:2023-11-29 15:35:49,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1646814.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:长度   列表   列表中   DataFrame   Pandas

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!