Given a DataFrame df1
as follows:
df1 = pd.DataFrame({
'col1': [1,2,3,4],
'col2': [['a', 'b'], ['c'], ['a', 'd', 'b'], ['e']]
})
Which looks like:
col1 col2
0 1 [a, b]
1 2 [c]
2 3 [a, d, b]
3 4 [e]
I want to convert col2
- a column where each cell is a list - into several columns ( a
, b
, c
, d
, e
), where the values are boolean entries defining whether that column name existed in the original list, in the given row.
The output should follow this form:
df2 = pd.DataFrame({
'col1': [1,2,3,4],
'a': [True, False, True, False],
'b': [True, False, True, False],
'c': [False, True, False, False],
'd': [False, False, True, False],
'e': [False, False, False, True]
})
Which looks like:
col1 a b c d e
0 1 True True False False False
1 2 False False True False False
2 3 True True False True False
3 4 False False False False True
What's a clean way to do this?
Alternative approach using str.get_dummies
probably more efficient than apply
+ pd.Series
:
df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)
a b c d e
0 True True False False False
1 False False True False False
2 True True False True False
3 False False False False True
Timings:
df1.shape
(40000, 2)
%%timeit
df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)
286 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)
9.43 s ± 499 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One approach
df2 = df1.reset_index().explode('col2')
new_df = pd.crosstab(df2['col1'], df2['col2']).astype(bool)
print(new_df)
col2 a b c d e
col1
1 True True False False False
2 False False True False False
3 True True False True False
4 False False False False True
Or
new_df = (df1.reset_index()
.explode('col2')
.groupby(['index', 'col2'])['col1']
.any()
.unstack(fill_value=False))
new_df = (df1.reset_index()
.explode('col2')
.pivot_table(index='index',
values='col1',
columns='col2',
aggfunc='any',
fill_value=False)
)
解决了。
pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.