简体   繁体   English

Python:当所需索引系列重复时,旋转pandas DataFrame

[英]Python: pivot a pandas DataFrame when the desired index Series has duplicates

I have a pandas DataFrame my_data that looks like 我有一个熊猫DataFrame my_data看起来像

    event_id    user_id    attended
0     13          345         1
1     14          654         0
...

So event_id and user_id both have duplicates because there is an entry for each user and event combination. 所以event_iduser_id都重复,因为每个用户和事件组合都有一个条目。 What I want to do is reshape this into a DataFrame where my indices (rows) are the DISTINCT user_id 's, the columns are the DISTINCT event_id 's and the values in a given (row, col) is just the boolean 0 or 1 of whether they attended. 我想要做的是将其重塑为一个DataFrame,其中我的索引(行)是DISTINCT user_id ,列是DISTINCT event_id ,给定(行,col)中的值只是布尔值0或1他们是否参加。

It seems that the pivot method is appropriate but of course when I tried my_data.pivot(index='user_id', columns='event_id', values='attended') I got the error that the index has duplicates. 似乎pivot方法是合适的,但是当然当我尝试my_data.pivot(index='user_id', columns='event_id', values='attended')我得到了索引重复的错误。

I was thinking I should do some kind of groupby on the user_id 's first but I don't want to add up all the attended 1's and 0's for each user because I specifically want to separate the event_id 's as my new columns and keep separate which event was attended by each user. 我当时想我应该在user_id的第一个上进行某种groupby ,但我不想为每个用户加总所有attended 1和0,因为我特别想将event_id分隔为新列并保留分开每个用户参加哪个活动。

Any help would be greatly appreciated, thanks! 任何帮助将不胜感激,谢谢!

IIUC, pivot_table should give you what you want: IIUC, pivot_table应该给您您想要的东西:

>>> df = pd.DataFrame({"event_id": np.random.randint(10, 20, 20), "user_id": np.random.randint(100, 110, 20), "attended": np.random.randint(0, 2, 20)})
>>> df.pivot_table(index="user_id", columns="event_id", values="attended", 
    aggfunc=sum).fillna(0)
event_id  10  11  12  13  14  15  16  17  19
user_id                                     
101        0   0   0   1   0   0   0   0   0
103        0   0   0   0   0   0   0   0   0
104        0   0   0   0   0   0   0   0   1
105        0   0   0   0   0   0   0   0   0
106        0   0   0   0   0   0   1   0   0
107        1   0   0   0   0   0   0   1   0
108        0   0   0   1   0   0   0   0   0
109        0   0   0   0   1   0   1   0   0

As written, if there are multiple rows with the same user/event combination (which probably isn't the case) the attendance will be summed. 如所写,如果有多个行具有相同的用户/事件组合(可能不是这种情况),那么出席人数将被累加。 It's easy enough to use any or clip the values instead if you want to guarantee the frame consists only of 0s and 1s. 如果要保证帧仅包含0和1,那么使用any值或剪切值就很容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM