简体   繁体   English

Pandas DataFrame 按两列分组并获取计数

[英]Pandas DataFrame Groupby two columns and get counts

I have a pandas dataframe in the following format:我有一个 pandas dataframe,格式如下:

df = pd.DataFrame([
    [1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], 
    list('AAABBBBABCBDDD'), 
    [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], 
    ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],
    ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']
]).T
df.columns = ['col1','col2','col3','col4','col5']

df: df:

   col1 col2 col3     col4 col5
0   1.1    A  1.1    x/y/z    1
1   1.1    A  1.7      x/y    3
2   1.1    A  2.5  x/y/z/n    3
3   2.6    B  2.6      x/u    2
4   2.5    B  3.3        x    4
5   3.4    B  3.8    x/u/v    2
6   2.6    B    4    x/y/z    5
7   2.6    A  4.2        x    3
8   3.4    B  4.3  x/u/v/b    6
9   3.4    C  4.5        -    3
10  2.6    B  4.6      x/y    5
11  1.1    D  4.7    x/y/z    1
12  1.1    D  4.7        x    1
13  3.3    D  4.8  x/u/v/w    1

I want to get the count by each row like following.我想像下面这样按每一行计算。 Expected Output:预计 Output:

col5 col2 count
1    A      1
     D      3
2    B      2
etc...

How to get my expected output?如何获得我预期的 output? And I want to find largest count for each 'col2' value?我想找到每个“col2”值的最大计数?

You are looking for size : 您正在寻找size

In [11]: df.groupby(['col5', 'col2']).size()
Out[11]:
col5  col2
1     A       1
      D       3
2     B       2
3     A       3
      C       1
4     B       1
5     B       2
6     B       1
dtype: int64

To get the same answer as waitingkuo (the "second question"), but slightly cleaner, is to groupby the level: 要获得与waitingkuo相同的答案(“第二个问题”),但稍微简洁一点,是对级别进行分组:

In [12]: df.groupby(['col5', 'col2']).size().groupby(level=1).max()
Out[12]:
col2
A       3
B       2
C       1
D       3
dtype: int64

Followed by @Andy's answer, you can do following to solve your second question: 紧跟@Andy的答案,您可以执行以下操作来解决第二个问题:

In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]: 
      0
col2   
A     3
B     2
C     1
D     3

Inserting data into a pandas dataframe and providing column name . 数据插入pandas数据框并提供列名

import pandas as pd
df = pd.DataFrame([['A','C','A','B','C','A','B','B','A','A'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['Alphabet','Words']]
print(df)   #printing dataframe.

This is our printed data: 这是我们的打印数据:

在此处输入图片说明

For making a group of dataframe in pandas and counter , 为了在熊猫和柜台上制作一组数据框
You need to provide one more column which counts the grouping, let's call that column as, "COUNTER" in dataframe . 您需要再提供一列来对分组进行计数, 在dataframe中将该列称为“ COUNTER”

Like this: 像这样:

df['COUNTER'] =1       #initially, set that counter to 1.
group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum() #sum function
print(group_data)

OUTPUT: 输出:

在此处输入图片说明

Idiomatic solution that uses only a single groupby 仅使用单个groupby的惯用解决方案

(df.groupby(['col5', 'col2']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   .drop_duplicates(subset='col2'))

  col5 col2  count
0    3    A      3
1    1    D      3
2    5    B      2
6    3    C      1

Explanation 说明

The result of the groupby size method is a Series with col5 and col2 in the index. groupby size方法的结果是在索引中具有col5col2的Series。 From here, you can use another groupby method to find the maximum value of each value in col2 but it is not necessary to do. 从这里,您可以使用另一种groupby方法在col2找到每个值的最大值,但是没有必要这样做。 You can simply sort all the values descendingly and then keep only the rows with the first occurrence of col2 with the drop_duplicates method. 您可以简单地对所有值进行降序排序,然后使用drop_duplicates方法仅保留第一次出现col2的行。

Should you want to add a new column (say 'count_column') containing the groups' counts into the dataframe: 您是否要在数据框中添加一个新的列(例如“ count_column”),其中包含组的计数:

df.count_column=df.groupby(['col5','col2']).col5.transform('count')

(I picked 'col5' as it contains no nan) (我选择了“ col5”,因为它不包含nan)

Since pandas 1.1.0., you can value_counts on a DataFrame:自 pandas 1.1.0. 起,您可以对 DataFrame 进行value_counts

out = df[['col5','col2']].value_counts().sort_index()

Output: Output:

col5  col2
1     A       1
      D       3
2     B       2
3     A       3
      C       1
4     B       1
5     B       2
6     B       1
dtype: int64

If you want to construct a DataFrame as a final result (not a pandas Series), use the as_index= parameter:如果要构造一个 DataFrame 作为最终结果(不是 pandas 系列),请使用as_index=参数:

df.groupby(['col5', 'col2'], as_index=False).size()

资源1


To get the final desired output, pivot_table may be used as well (instead of double groupby ):要获得最终所需的 output,也可以使用pivot_table (而不是双groupby ):

df.pivot_table(index='col5', columns='col2', aggfunc='size').max()

资源2


If you don't want to count NaN values, you can use groupby.count :如果你不想计算 NaN 值,你可以使用groupby.count

df.groupby(['col5', 'col2']).count()

资源3

Note that since each column may have different number of non-NaN values, unless you specify the column, a simple groupby.count call may return different counts for each column as in the example above.请注意,由于每一列可能有不同数量的非 NaN 值,除非您指定该列,否则一个简单的groupby.count调用可能会为每一列返回不同的计数,如上例所示。 For example, the number of non-NaN values in col1 after grouping by ['col5', 'col2'] is as follows:例如按['col5', 'col2']分组后col1中非NaN值的个数如下:

df.groupby(['col5', 'col2'])['col1'].count()

资源4

您可以只使用内置函数计数,然后使用groupby函数

df.groupby(['col5','col2']).count()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM