简体   繁体   English

在pandas DataFrame中将列表拆分为多列

[英]Split lists into multiple columns in a pandas DataFrame

I have a source system that gives me data like this: 我有一个可以为我提供如下数据的源系统:

Name    |Hobbies
----------------------------------
"Han"   |"Art;Soccer;Writing"
"Leia"  |"Art;Baking;Golf;Singing"
"Luke"  |"Baking;Writing"

Each hobby list is semicolon delimited. 每个爱好列表以分号分隔。 I want to turn this into a table like structure with a column for each hobby and a flag to indicate if a person selected that hobby: 我想把它变成一个像表一样的结构,每个爱好都有一个专栏,并有一个标志来表明一个人是否选择了那个爱好:

Name    |Art     |Baking  |Golf    |Singing |Soccer  |Writing  
--------------------------------------------------------------
"Han"   |1       |0       |0       |0       |1       |1
"Leia"  |1       |1       |1       |1       |0       |0
"Luke"  |0       |1       |0       |0       |0       |1

Here's code to generate the sample data in a pandas dataframe: 这是在pandas数据框中生成示例数据的代码:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     [
...         {'name': 'Han',   'hobbies': 'Art;Soccer;Writing'},
...         {'name': 'Leia',  'hobbies': 'Art;Baking;Golf;Singing'},
...         {'name': 'Luke',  'hobbies': 'Baking;Writing'},
...     ]
... )
>>> df
                   hobbies  name
0       Art;Soccer;Writing   Han
1  Art;Baking;Golf;Singing  Leia
2           Baking;Writing  Luke

Right now, I'm using the following code to get the data into a datatrame that has the structure I want, but it is really slow (my actual data set has about 1.5 million rows): 现在,我正在使用以下代码将数据放入具有我想要的结构的数据存储区中,但这确实很慢(我的实际数据集大约有150万行):

>>> df2 = pd.DataFrame(columns=['name', 'hobby'])
>>>
>>> for index, row in df.iterrows():
...     for value in str(row['hobbies']).split(';'):
...         d = {'name':row['name'], 'value':value}
...         df2 = df2.append(d, ignore_index=True)
...
>>> df2 = df2.groupby('name')['value'].value_counts()
>>> df2 = df2.unstack(level=-1).fillna(0)
>>>
>>> df2
value  Art  Baking  Golf  Singing  Soccer  Writing
name
Han    1.0     0.0   0.0      0.0     1.0      1.0
Leia   1.0     1.0   1.0      1.0     0.0      0.0
Luke   0.0     1.0   0.0      0.0     0.0      1.0

Is there a more efficient way to do this? 有没有更有效的方法可以做到这一点?

What you could do is instead of appending columns on every iteration append all of them after running your loop: 您可以做的是,而不是在每次迭代时都添加列,而是在运行循环后将所有列都附加:

df3 = pd.DataFrame(columns=['name', 'hobby'])
d_list = []

for index, row in df.iterrows():
    for value in str(row['hobbies']).split(';'):
        d_list.append({'name':row['name'], 
                       'value':value})
df3 = df3.append(d_list, ignore_index=True)
df3 = df3.groupby('name')['value'].value_counts()
df3 = df3.unstack(level=-1).fillna(0)
df3

I checked how much time it would take for you example dataframe. 我检查了示例数据帧将花费多少时间。 With the improvement I suggest it's ~50 times faster. 通过改进,我建议将速度提高约50倍。

Why not just change the DataFrame in place? 为什么不就地更改DataFrame?

for idx, row in df.iterrows():
    for hobby in row.hobbies.split(";"):
        df.loc[idx, hobby] = True

df.fillna(False, inplace=True)

Actually, using .str.split and .melt should be slighter faster then looping with iterrows . 实际上,使用.str.split.melt应该更快一些,然后循环使用iterrows

  1. Splitting to multiple columns: 拆分为多列:

     >>> df = pd.DataFrame([{'name': 'Han', 'hobbies': 'Art;Soccer;Writing'}, {'name': 'Leia', 'hobbies': 'Art;Baking;Golf;Singing'}, {'name': 'Luke', 'hobbies': 'Baking;Writing'}]) >>> hobbies = df['hobbies'].str.split(';', expand=True) >>> hobbies 0 1 2 3 0 Art Soccer Writing None 1 Art Baking Golf Singing 2 Baking Writing None None 
  2. Unpivoting hobbies by names: 依名称讲究爱好:

     >>> df = df.drop('hobbies', axis=1) >>> df = df.join(hobbies) >>> stacked = df.melt('name', value_name='hobby').drop('variable', axis=1) >>> stacked name hobby 0 Han Art 1 Leia Art 2 Luke Baking 3 Han Soccer 4 Leia Baking 5 Luke Writing 6 Han Writing 7 Leia Golf 8 Luke None 9 Han None 10 Leia Singing 11 Luke None 
  3. Counting the values: 计算值:

     >>> counts = stacked.groupby('name')['hobby'].value_counts() >>> result = counts.unstack(level=-1).fillna(0).astype(int) >>> result hobby Art Baking Golf Singing Soccer Writing name Han 1 0 0 0 1 1 Leia 1 1 1 1 0 0 Luke 0 1 0 0 0 1 

There are alternatives to steps 2 and 3, like using get_dummies or crosstab , as discussed here: Pandas get_dummies on multiple columns , but the first one will eat your memory, and the second one is much slower. 第2步和第3步有其他选择,例如使用get_dummiescrosstab ,如此处所述: 多列上的Pandas get_dummies ,但是第一个会消耗您的内存,第二个会慢很多。


References: 参考文献:
Pandas split column into multiple columns by comma 熊猫用逗号将列分成多列
Pandas DataFrame stack multiple column values into single column Pandas DataFrame将多列值堆叠到单列中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM