简体   繁体   中英

Pandas split a column of unequal length lists into multiple boolean columns

Given a DataFrame df1 as follows:

df1 = pd.DataFrame({
    'col1': [1,2,3,4],
    'col2': [['a', 'b'], ['c'], ['a', 'd', 'b'], ['e']]
})

Which looks like:

    col1    col2
0   1       [a, b]
1   2       [c]
2   3       [a, d, b]
3   4       [e]

I want to convert col2 - a column where each cell is a list - into several columns ( a , b , c , d , e ), where the values are boolean entries defining whether that column name existed in the original list, in the given row.

The output should follow this form:

df2 = pd.DataFrame({
    'col1': [1,2,3,4],
    'a': [True, False, True, False],
    'b': [True, False, True, False],
    'c': [False, True, False, False],
    'd': [False, False, True, False],
    'e': [False, False, False, True]
    
})

Which looks like:

    col1    a       b       c       d       e
0   1       True    True    False   False   False
1   2       False   False   True    False   False
2   3       True    True    False   True    False
3   4       False   False   False   False   True

What's a clean way to do this?

Alternative approach using str.get_dummies probably more efficient than apply + pd.Series :

df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)

       a      b      c      d      e
0   True   True  False  False  False
1  False  False   True  False  False
2   True   True  False   True  False
3  False  False  False  False   True

Timings:

df1.shape
(40000, 2)

%%timeit
df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)
286 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)
9.43 s ± 499 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One approach

df2 = df1.reset_index().explode('col2')
new_df = pd.crosstab(df2['col1'], df2['col2']).astype(bool)
print(new_df)

col2      a      b      c      d      e
col1                                   
1      True   True  False  False  False
2     False  False   True  False  False
3      True   True  False   True  False
4     False  False  False  False   True

Or

new_df = (df1.reset_index()
              .explode('col2')
              .groupby(['index', 'col2'])['col1']
              .any()
              .unstack(fill_value=False))

new_df = (df1.reset_index()
             .explode('col2')
             .pivot_table(index='index', 
                          values='col1', 
                          columns='col2', 
                          aggfunc='any', 
                         fill_value=False) 
         )

解决了。

pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM