Pandas：如何在由另一列分组的列上获取具有最大值 value_count 的行作为数据框

Question

I have three columns in a pandas dataframe, Date , Hour and Content .我在 Pandas 数据框中有三列Date 、 Hour和Content 。 I want to get the hour in a day when there is the most content of that day.我想获得一天中内容最多的小时。 I am using messages.groupby(["Date", "Hour"]).Content.count().groupby(level=0).tail(1) .我正在使用messages.groupby(["Date", "Hour"]).Content.count().groupby(level=0).tail(1) 。 I don't know what groupby(level=0) is doing here.我不知道groupby(level=0)在这里做什么。 It outputs as follows-它输出如下 -

Date        Hour
2018-04-12  23       4
2018-04-13  21      43
2018-04-14  9        1
2018-04-15  23      29
2018-04-16  17       1
                    ..
2020-04-23  20       1
2020-04-24  22       1
2020-04-25  20       1
2020-04-26  23      32
2020-04-27  23       3

This is a pandas series object, and my desired Date and Hour columns are MultiIndex here.这是一个熊猫系列对象，我想要的Date和Hour列是MultiIndex在这里。 If I try to convert the MultiIndex object to dataframe using pd.DataFrame(most_active.index) , most_active being the output of the previous code, it creates a dataframe of tuples as below-如果我尝试使用pd.DataFrame(most_active.index)将MultiIndex对象转换为数据帧， most_active是前一个代码的输出，它会创建一个元组数据帧，如下所示 -

                    0
0    (2018-04-12, 23)
1    (2018-04-13, 21)
2     (2018-04-14, 9)
3    (2018-04-15, 23)
4    (2018-04-16, 17)
..                ...
701  (2020-04-23, 20)
702  (2020-04-24, 22)
703  (2020-04-25, 20)
704  (2020-04-26, 23)
705  (2020-04-27, 23)

But I need two separate columns of Date and Hour .但我需要两列独立的Date和Hour 。 What is the best way for this?最好的方法是什么？

Answer 1

Edit because I misunderstood your question编辑因为我误解了你的问题

First, you have to count the total content by date-hour, just like you did:首先，您必须按日期-小时计算总内容，就像您所做的一样：

df = messages.groupby(["Date", "Hour"], as_index=False).Content.count()

Here, I left the groups in their original columns by passing the parameter as_index=False .在这里，我通过传递参数as_index=False将组保留在原始列中。

Then, you can run the code below, provided in the original answer:然后，您可以运行原始答案中提供的以下代码：

Supposing you have unique index IDs (if not, just do df.reset_index(inplace=True) ), you can use idxmax method in groupby .假设您有唯一的索引 ID（如果没有，只需执行df.reset_index(inplace=True) ），您可以在groupby使用idxmax方法。 It will return the index with the biggest value per group, then you can use them for slicing the dataframe.它将返回每组具有最大值的索引，然后您可以使用它们来切片数据帧。

For example:例如：

df.loc[df.groupby(['Date', 'Hour'])['Content'].idxmax()]

As an alternative (without using groupby), you can first sort the values in descending order, them remove the Date-Hour duplicates:作为替代方案（不使用 groupby），您可以先按降序对值进行排序，然后删除日期-小时重复项：

df.sort_values('Content', ascending=False).drop_duplicates(subset=['Date', 'Hour'])

Finally, you get a MultiIndex with the set_index() method:最后，您可以使用set_index()方法获得MultiIndex ：

df.set_index(['Date','Hour'])

Pandas：如何在由另一列分组的列上获取具有最大值 value_count 的行作为数据框

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-10-16 23:17:02

Pandas：如何在由另一列分组的列上获取具有最大值 value_count 的行作为数据框

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-10-16 23:17:02

解决方案1
0 已采纳 2020-10-16 23:17:02