繁体   English   中英

如何从具有不同长度的列表列表中创建 Pandas DataFrame?

[英]How to create a Pandas DataFrame from a list of lists with different lengths?

我的数据格式如下

data = [["a", "b", "c"],
        ["b", "c"],
        ["d", "e", "f", "c"]]

并且我想要一个 DataFrame 将所有唯一的字符串作为列和出现的二进制值

    a  b  c  d  e  f
0   1  1  1  0  0  0
1   0  1  1  0  0  0
2   0  0  1  1  1  1

我有一个使用列表推导式的工作代码,但对于大数据来说速度很慢。

# vocab_list contains all the unique keys, which is obtained when reading in data from file
df = pd.DataFrame([[1 if word in entry else 0 for word in vocab_list] for entry in data])

有没有办法优化这个任务? 谢谢。

编辑(实际数据的小样本):

[['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'] , ['a', 'abiding', 'age', 'agravated', 'aggressively', 'all', 'almost', 'one', 'already', 'also', 'although']]

为了获得更好的性能,请使用MultiLabelBinarizer

data = [["a", "b", "c"],
        ["b", "c"],
        ["d", "e", "f", "c"]]

from sklearn.preprocessing import MultiLabelBinarizer    
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  1  1  0  0  0
2  0  0  1  1  1  1

编辑:

data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]

from sklearn.preprocessing import MultiLabelBinarizer    
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
   a  abiding  about  absurd  again  age  aggravated  aggressively  all  \
0  1        0      1       1      1    0           0             0    0   
1  1        1      0       0      0    1           1             1    1   

   almost  ...  also  although  an  associates  writes  wrote  x  york  you  \
0       0  ...     0         0   1           1       1      1  1     1    1   
1       1  ...     1         1   0           0       0      0  0     0    0   

   your  
0     1  
1     0  

[2 rows x 22 columns]

纯熊猫解决方案是可能的,但我想它应该更慢:

df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
   a  b  d  c  e  f
0  1  1  0  1  0  0
1  0  1  0  1  0  0
2  0  0  1  1  1  1

df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
   a  abiding  about  absurd  age  again  aggravated  aggressively  an  all  \
0  1        0      1       1    0      1           0             0   1    0   
1  1        1      0       0    1      0           1             1   0    1   

   ...  writes  alone  wrote  already  x  also  york  although  you  your  
0  ...       1      0      1        0  1     0     1         0    1     1  
1  ...       0      1      0        1  0     1     0         1    0     0  

[2 rows x 22 columns]

您可以在列表理解中使用joinstr.get_dummies

df = pd.Series(['|'.join(x) for x in data]).str.get_dummies()

[出去]

   a  b  c  d  e  f
0  1  1  1  0  0  0
1  0  1  1  0  0  0
2  0  0  1  1  1  1

出于好奇,我对建议的解决方案进行了计时:

from string import ascii_letters
from random import choice, randint
from datetime import datetime
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

data = []
for _ in range(10000):
    data.append([choice(ascii_letters) for _ in range(randint(25, 65))])

print("Time using 'pd.Series' and 'str.get_dummies()':")
startTime = datetime.now()
df = pd.Series(['|'.join(x) for x in data]).str.get_dummies()
print(datetime.now() - startTime)

print("Time using 'pd.get_dummies()':")
startTime = datetime.now()
df2 = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print(datetime.now() - startTime)

print("Time using 'MultiLabelBinarizer()':")
startTime = datetime.now()
mlb = MultiLabelBinarizer()
df3 = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print(datetime.now() - startTime)

尽管每次使用随机长度的列表时结果都会有所不同,但差异或多或少是相同的:

Time using 'pd.Series' and 'str.get_dummies()':
0:00:00.450311
Time using 'pd.get_dummies()':
0:00:00.498003
Time using 'MultiLabelBinarizer()':
0:00:00.083955

因此,确实,通过使用 sklearn,您可以获得更快的结果。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM