熊猫通过两个文本列并根据计数返回最大行

Question

I'm trying to figure out the max (First_Word, Group) pairs 我正在尝试找出最大(First_Word, Group)对

import pandas as pd

df = pd.DataFrame({'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'],
           'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'],
           'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice',
                'apple fell out of the tree', 'partrige in a pear tree']},
          columns=['First_Word', 'Group', 'Text'])

  First_Word         Group                        Text
0      apple    apple bins     where to buy apple bins
1      apple   apple trees         i see an apple tree
2     orange  orange juice         i like orange juice
3      apple   apple trees  apple fell out of the tree
4       pear     pear tree     partrige in a pear tree

Then I do a groupby : 然后我做一个groupby ：

grouped = df.groupby(['First_Word', 'Group']).count()
                         Text
First_Word Group             
apple      apple bins       1
           apple trees      2
orange     orange juice     1
pear       pear tree        1

And I now want to filter it down to only unique index rows that have the max Text counts. 现在，我想将其筛选为仅具有最大Text计数的唯一索引行。 Below you'll notice apple bins was removed because apple trees has the max value. 在下面，您会注意到apple bins已删除，因为apple trees具有最大值。

                         Text
First_Word Group             
apple      apple trees      2
orange     orange juice     1
pear       pear tree        1

This max value of group question is similar but when I try something like this: 小组问题的最大价值类似，但是当我尝试这样的事情时：

df.groupby(["First_Word", "Group"]).count().apply(lambda t: t[t['Text']==t['Text'].max()])

I get an error: KeyError: ('Text', 'occurred at index Text') . 我收到一个错误： KeyError: ('Text', 'occurred at index Text') 。 If I add axis=1 to the apply I get IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)') 如果我将axis=1添加到apply IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)')得到IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)')

Answer 1

Given grouped , you now want to group by the First Word index level, and find the index labels of the maximum row for each group (using idxmax ): 给定grouped ，您现在想按First Word索引级别进行分组，并找到每个组的最大行的索引标签（使用idxmax ）：

In [39]: grouped.groupby(level='First_Word')['Text'].idxmax()
Out[39]: 
First_Word
apple       (apple, apple trees)
orange    (orange, orange juice)
pear           (pear, pear tree)
Name: Text, dtype: object

You can then use grouped.loc to select rows from grouped by index label: 然后，您可以使用grouped.loc从grouped索引标签grouped的行中选择行：

import pandas as pd
df = pd.DataFrame(
    {'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'],
     'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'],
     'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice',
              'apple fell out of the tree', 'partrige in a pear tree']},
    columns=['First_Word', 'Group', 'Text'])

grouped = df.groupby(['First_Word', 'Group']).count()
result = grouped.loc[grouped.groupby(level='First_Word')['Text'].idxmax()]
print(result)

yields 产量

                         Text
First_Word Group             
apple      apple trees      2
orange     orange juice     1
pear       pear tree        1

熊猫通过两个文本列并根据计数返回最大行

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-06-09 21:52:25

熊猫通过两个文本列并根据计数返回最大行

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-06-09 21:52:25

解决方案1
2 已采纳 2016-06-09 21:52:25