[英]Python: pivot a pandas DataFrame when the desired index Series has duplicates
I have a pandas DataFrame my_data
that looks like 我有一个熊猫DataFrame
my_data
看起来像
event_id user_id attended
0 13 345 1
1 14 654 0
...
So event_id
and user_id
both have duplicates because there is an entry for each user and event combination. 所以
event_id
和user_id
都重复,因为每个用户和事件组合都有一个条目。 What I want to do is reshape this into a DataFrame where my indices (rows) are the DISTINCT user_id
's, the columns are the DISTINCT event_id
's and the values in a given (row, col) is just the boolean 0 or 1 of whether they attended. 我想要做的是将其重塑为一个DataFrame,其中我的索引(行)是DISTINCT
user_id
,列是DISTINCT event_id
,给定(行,col)中的值只是布尔值0或1他们是否参加。
It seems that the pivot
method is appropriate but of course when I tried my_data.pivot(index='user_id', columns='event_id', values='attended')
I got the error that the index has duplicates. 似乎
pivot
方法是合适的,但是当然当我尝试my_data.pivot(index='user_id', columns='event_id', values='attended')
我得到了索引重复的错误。
I was thinking I should do some kind of groupby
on the user_id
's first but I don't want to add up all the attended
1's and 0's for each user because I specifically want to separate the event_id
's as my new columns and keep separate which event was attended by each user. 我当时想我应该在
user_id
的第一个上进行某种groupby
,但我不想为每个用户加总所有attended
1和0,因为我特别想将event_id
分隔为新列并保留分开每个用户参加哪个活动。
Any help would be greatly appreciated, thanks! 任何帮助将不胜感激,谢谢!
IIUC, pivot_table
should give you what you want: IIUC,
pivot_table
应该给您您想要的东西:
>>> df = pd.DataFrame({"event_id": np.random.randint(10, 20, 20), "user_id": np.random.randint(100, 110, 20), "attended": np.random.randint(0, 2, 20)})
>>> df.pivot_table(index="user_id", columns="event_id", values="attended",
aggfunc=sum).fillna(0)
event_id 10 11 12 13 14 15 16 17 19
user_id
101 0 0 0 1 0 0 0 0 0
103 0 0 0 0 0 0 0 0 0
104 0 0 0 0 0 0 0 0 1
105 0 0 0 0 0 0 0 0 0
106 0 0 0 0 0 0 1 0 0
107 1 0 0 0 0 0 0 1 0
108 0 0 0 1 0 0 0 0 0
109 0 0 0 0 1 0 1 0 0
As written, if there are multiple rows with the same user/event combination (which probably isn't the case) the attendance will be summed. 如所写,如果有多个行具有相同的用户/事件组合(可能不是这种情况),那么出席人数将被累加。 It's easy enough to use
any
or clip the values instead if you want to guarantee the frame consists only of 0s and 1s. 如果要保证帧仅包含0和1,那么使用
any
值或剪切值就很容易。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.