I have a Pandas Data Frame like this:
uid category count
0 1 comedy 5
1 1 drama 7
2 2 drama 4
3 3 other 10
4 3 comedy 6
Except there are dozens of categories, millions of rows, and a few dozen other columns.
I want to turn that into something like this:
id cat_comedy cat_drama cat_other
0 1 5 7 0
1 2 0 4 0
2 3 6 0 10
I have no idea how to do this and am looking for tips/hints/full solutions. I don't really care about the row index.
Thanks.
I think this is what you're after (the operation is called a 'pivot'):
from pandas import DataFrame
df = DataFrame([
{'id': 1, 'category': 'comedy', 'count': 5},
{'id': 1, 'category': 'drama', 'count': 7},
{'id': 2, 'category': 'drama', 'count': 4},
{'id': 3, 'category': 'other', 'count': 10},
{'id': 3, 'category': 'comedy', 'count': 6}
]).set_index('id')
result = df.pivot(columns=['category'])
print(result)
Result:
count
category comedy drama other
id
1 5.0 7.0 NaN
2 NaN 4.0 NaN
3 6.0 NaN 10.0
In response to your comment, if you don't want the id
as an index for the df
, you can tell the operation to use it as the index for the pivot. You'll need pivot_table
instead of pivot
to achieve this, as it allows can handle duplicate values for one pivoted index/column pair.
And replacing the NaN
with zeroes is also an option:
df = DataFrame([
{'uid': 1, 'category': 'comedy', 'count': 5},
{'uid': 1, 'category': 'drama', 'count': 7},
{'uid': 2, 'category': 'drama', 'count': 4},
{'uid': 3, 'category': 'other', 'count': 10},
{'uid': 3, 'category': 'comedy', 'count': 6}
])
result = df.pivot_table(columns=['category'], index='uid', fill_value=0)
print(result)
However, note that the resulting table still has uid
as its index. If that's not what you want, you can revert the resulting columns back to a normal one:
result = df.pivot_table(columns=['category'], index='uid', fill_value=0).reset_index()
The final result:
uid count
category comedy drama other
0 1 5 7 0
1 2 0 4 0
2 3 6 0 10
The original answer from @Grismar (upvoted since he got it in first) is really close but doesn't quite work. Don't reset your index before the pivot call, and then do the following:
df2 = df.pivot_table(columns='category', index='uid', aggfunc=sum)
df2 = df2.fillna(0).reset_index()
df2 is now the dataframe you want. The fillna
function replaces all the NaNs
with 0s
.
Complete solution using pivot_table
:
import pandas as pd
df = pd.DataFrame([
{'uid': 1, 'category': 'comedy', 'count': 5},
{'uid': 1, 'category': 'drama', 'count': 7},
{'uid': 2, 'category': 'drama', 'count': 4},
{'uid': 3, 'category': 'other', 'count': 10},
{'uid': 3, 'category': 'comedy', 'count': 6}
])
df.pivot_table(
columns='category',
index='uid',
aggfunc=sum,
fill_value=0
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.