![](/img/trans.png)
[英]How to create a list of dictionaries from a dictionary with lists of different lengths
[英]How to create a Pandas DataFrame from a list of lists with different lengths?
我的数据格式如下
data = [["a", "b", "c"],
["b", "c"],
["d", "e", "f", "c"]]
并且我想要一个 DataFrame 将所有唯一的字符串作为列和出现的二进制值
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
我有一个使用列表推导式的工作代码,但对于大数据来说速度很慢。
# vocab_list contains all the unique keys, which is obtained when reading in data from file
df = pd.DataFrame([[1 if word in entry else 0 for word in vocab_list] for entry in data])
有没有办法优化这个任务? 谢谢。
编辑(实际数据的小样本):
[['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'] , ['a', 'abiding', 'age', 'agravated', 'aggressively', 'all', 'almost', 'one', 'already', 'also', 'although']]
为了获得更好的性能,请使用MultiLabelBinarizer
:
data = [["a", "b", "c"],
["b", "c"],
["d", "e", "f", "c"]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
编辑:
data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a abiding about absurd again age aggravated aggressively all \
0 1 0 1 1 1 0 0 0 0
1 1 1 0 0 0 1 1 1 1
almost ... also although an associates writes wrote x york you \
0 0 ... 0 0 1 1 1 1 1 1 1
1 1 ... 1 1 0 0 0 0 0 0 0
your
0 1
1 0
[2 rows x 22 columns]
纯熊猫解决方案是可能的,但我想它应该更慢:
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a b d c e f
0 1 1 0 1 0 0
1 0 1 0 1 0 0
2 0 0 1 1 1 1
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a abiding about absurd age again aggravated aggressively an all \
0 1 0 1 1 0 1 0 0 1 0
1 1 1 0 0 1 0 1 1 0 1
... writes alone wrote already x also york although you your
0 ... 1 0 1 0 1 0 1 0 1 1
1 ... 0 1 0 1 0 1 0 1 0 0
[2 rows x 22 columns]
您可以在列表理解中使用join
和str.get_dummies
:
df = pd.Series(['|'.join(x) for x in data]).str.get_dummies()
[出去]
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
出于好奇,我对建议的解决方案进行了计时:
from string import ascii_letters
from random import choice, randint
from datetime import datetime
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
data = []
for _ in range(10000):
data.append([choice(ascii_letters) for _ in range(randint(25, 65))])
print("Time using 'pd.Series' and 'str.get_dummies()':")
startTime = datetime.now()
df = pd.Series(['|'.join(x) for x in data]).str.get_dummies()
print(datetime.now() - startTime)
print("Time using 'pd.get_dummies()':")
startTime = datetime.now()
df2 = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print(datetime.now() - startTime)
print("Time using 'MultiLabelBinarizer()':")
startTime = datetime.now()
mlb = MultiLabelBinarizer()
df3 = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print(datetime.now() - startTime)
尽管每次使用随机长度的列表时结果都会有所不同,但差异或多或少是相同的:
Time using 'pd.Series' and 'str.get_dummies()':
0:00:00.450311
Time using 'pd.get_dummies()':
0:00:00.498003
Time using 'MultiLabelBinarizer()':
0:00:00.083955
因此,确实,通过使用 sklearn,您可以获得更快的结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.