[英]Pandas: how to get the rows that has the maximum value_count on a column grouping by another column as a dataframe
I have three columns in a pandas dataframe, Date
, Hour
and Content
.我在 Pandas 数据框中有三列Date
、 Hour
和Content
。 I want to get the hour in a day when there is the most content of that day.我想获得一天中内容最多的小时。 I am using messages.groupby(["Date", "Hour"]).Content.count().groupby(level=0).tail(1)
.我正在使用messages.groupby(["Date", "Hour"]).Content.count().groupby(level=0).tail(1)
。 I don't know what groupby(level=0)
is doing here.我不知道groupby(level=0)
在这里做什么。 It outputs as follows-它输出如下 -
Date Hour
2018-04-12 23 4
2018-04-13 21 43
2018-04-14 9 1
2018-04-15 23 29
2018-04-16 17 1
..
2020-04-23 20 1
2020-04-24 22 1
2020-04-25 20 1
2020-04-26 23 32
2020-04-27 23 3
This is a pandas series object, and my desired Date
and Hour
columns are MultiIndex
here.这是一个熊猫系列对象,我想要的Date
和Hour
列是MultiIndex
在这里。 If I try to convert the MultiIndex
object to dataframe using pd.DataFrame(most_active.index)
, most_active
being the output of the previous code, it creates a dataframe of tuples as below-如果我尝试使用pd.DataFrame(most_active.index)
将MultiIndex
对象转换为数据帧, most_active
是前一个代码的输出,它会创建一个元组数据帧,如下所示 -
0
0 (2018-04-12, 23)
1 (2018-04-13, 21)
2 (2018-04-14, 9)
3 (2018-04-15, 23)
4 (2018-04-16, 17)
.. ...
701 (2020-04-23, 20)
702 (2020-04-24, 22)
703 (2020-04-25, 20)
704 (2020-04-26, 23)
705 (2020-04-27, 23)
But I need two separate columns of Date
and Hour
.但我需要两列独立的Date
和Hour
。 What is the best way for this?最好的方法是什么?
Edit because I misunderstood your question编辑因为我误解了你的问题
First, you have to count the total content by date-hour, just like you did:首先,您必须按日期-小时计算总内容,就像您所做的一样:
df = messages.groupby(["Date", "Hour"], as_index=False).Content.count()
Here, I left the groups in their original columns by passing the parameter as_index=False
.在这里,我通过传递参数as_index=False
将组保留在原始列中。
Then, you can run the code below, provided in the original answer:然后,您可以运行原始答案中提供的以下代码:
Supposing you have unique index IDs (if not, just do df.reset_index(inplace=True)
), you can use idxmax
method in groupby
.假设您有唯一的索引 ID(如果没有,只需执行df.reset_index(inplace=True)
),您可以在groupby
使用idxmax
方法。 It will return the index with the biggest value per group, then you can use them for slicing the dataframe.它将返回每组具有最大值的索引,然后您可以使用它们来切片数据帧。
For example:例如:
df.loc[df.groupby(['Date', 'Hour'])['Content'].idxmax()]
As an alternative (without using groupby), you can first sort the values in descending order, them remove the Date-Hour duplicates:作为替代方案(不使用 groupby),您可以先按降序对值进行排序,然后删除日期-小时重复项:
df.sort_values('Content', ascending=False).drop_duplicates(subset=['Date', 'Hour'])
Finally, you get a MultiIndex
with the set_index()
method:最后,您可以使用set_index()
方法获得MultiIndex
:
df.set_index(['Date','Hour'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.