在多个布尔列中拆分pandas dataframe列

Question

I have a csv with 10K rows of movie data. 我有一个包含1万行电影数据的csv。

In the "genre" column, the data looks like this: 在“类型”列中，数据如下所示：

Adventure|Science Fiction|Thriller
Action|Adventure|Science Fiction|Fantasy
Action|Crime|Thriller
Western|Drama|Adventure|Thriller

I want to create multiple sub columns (ie action yes/no, adventure yes/no, drama yes/no, etc) based on the genre column. 我想根据类型栏创建多个子栏（即动作是/否，冒险是/否，戏剧是/否等）。

question 1: how can i first determine all the unique genre titles in the genre column? 问题1：如何首先确定“风格”列中的所有唯一风格标题？

question 2: after i determine all the unique genre titles, how to create all the necessary ['insert genre' yes/no] columns? 问题2：确定所有唯一的流派标题后，如何创建所有必要的['插入流派'是/否”列？

Answer 1

Use str.get_dummies : 使用str.get_dummies ：

df = df['col'].str.get_dummies('|').replace({0:'no', 1:'yes'})

Or: 要么：

d = {0:'no', 1:'yes'}
df = df['col'].str.get_dummies('|').applymap(d.get)

For better performance use MultiLabelBinarizer : 为了获得更好的性能，请使用MultiLabelBinarizer ：

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(df['col'].str.split('|')) ,
                   columns=mlb.classes_, 
                   index=df.index)
        .applymap(d.get))

print (df)
  Action Adventure Crime Drama Fantasy Science Fiction Thriller Western
0     no       yes    no    no      no             yes      yes      no
1    yes       yes    no    no     yes             yes       no      no
2    yes        no   yes    no      no              no      yes      no
3     no       yes    no   yes      no              no      yes     yes

Detail : 详细说明 ：

print (df['col'].str.get_dummies('|'))
   Action  Adventure  Crime  Drama  Fantasy  Science Fiction  Thriller  \
0       0          1      0      0        0                1         1   
1       1          1      0      0        1                1         0   
2       1          0      1      0        0                0         1   
3       0          1      0      1        0                0         1   

   Western  
0        0  
1        0  
2        0  
3        1

Timings : 时间：

df = pd.concat([df] * 10000, ignore_index=True)


In [361]: %timeit pd.DataFrame(mlb.fit_transform(df['col'].str.split('|')) ,columns=mlb.classes_,  index=df.index)
10 loops, best of 3: 120 ms per loop

In [362]: %timeit df['col'].str.get_dummies('|')
1 loop, best of 3: 324 ms per loop

In [363]: %timeit pd.get_dummies(df['col'].str.split('|').apply(pd.Series).stack()).sum(level=0)
1 loop, best of 3: 7.77 s per loop

Answer 2

Assuming your column is called Genres , this is one way. 假设您的专栏名为Genres ，这是一种方法。

res = pd.get_dummies(df['Genres'].str.split('|').apply(pd.Series).stack()).sum(level=0)

#    Action  Adventure  Crime  Drama  Fantasy  ScienceFiction  Thriller  Western
# 0       0          1      0      0        0               1         1        0
# 1       1          1      0      0        1               1         0        0
# 2       1          0      1      0        0               0         1        0
# 3       0          1      0      1        0               0         1        1

You can then convert binary values to "No" / "Yes" via pd.DataFrame.applymap : 然后，您可以通过pd.DataFrame.applymap将二进制值转换为“否” /“是”：

df = df.applymap({0: 'no', 1: 'yes'}.get)

在多个布尔列中拆分pandas dataframe列

问题描述

2 个解决方案

解决方案1
1 2018-04-13 10:23:09

解决方案2
1 2018-04-13 10:27:53

在多个布尔列中拆分pandas dataframe列

问题描述

2 个解决方案

解决方案1 1 2018-04-13 10:23:09

解决方案2 1 2018-04-13 10:27:53

解决方案1
1 2018-04-13 10:23:09

解决方案2
1 2018-04-13 10:27:53