简体   繁体   English

Pandas groupby和transform基于多列

[英]Pandas groupby and transform based on multiple columns

I have seen a lot of similar questions but none seem to work for my case.我见过很多类似的问题,但似乎没有一个适合我的情况。 I'm pretty sure this is just a groupby transform but I keep getting KeyError along with axis issues.我很确定这只是一个 groupby 转换,但我不断收到KeyError以及axis问题。 I am trying to groupby filename and check count where pred != gt .我正在尝试对filename进行分组并检查pred != gt的计数。

For example Index 2 is the only one for f1.wav so 1, and Index (13,14,18) for f2.wav so 3.例如,索引 2 是f1.wav所以 1 的唯一一个,而f2.wav所以 3 是索引 (13,14,18)。

df = pd.DataFrame([{'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f2.wav'}])
    pred  gt filename
0      0   0   f1.wav
1      0   0   f1.wav
2      2   0   f1.wav
3      0   0   f1.wav
4      0   0   f1.wav
5      0   0   f1.wav
6      0   0   f1.wav
7      0   0   f1.wav
8      0   0   f1.wav
9      0   0   f1.wav
10     0   0   f2.wav

Expected output预期 output

    pred  gt filename  counts
0      0   0   f1.wav       1
1      0   0   f1.wav       1
2      2   0   f1.wav       1
3      0   0   f1.wav       1
4      0   0   f1.wav       1
5      0   0   f1.wav       1
6      0   0   f1.wav       1
7      0   0   f1.wav       1
8      0   0   f1.wav       1
9      0   0   f1.wav       1
10     0   0   f2.wav       3
11     0   0   f2.wav       3
12     2   2   f2.wav       3
13     0   2   f2.wav       3
14     0   2   f2.wav       3
15     0   0   f2.wav       3
16     0   0   f2.wav       3
17     2   2   f2.wav       3
18     0   2   f2.wav       3
19     2   0   f2.wav       3

I was thinking df.groupby('filename').transform(lambda x: x['pred'].ne(x['gt']).sum(), axis=1) but I get TypeError: Transform function invalid for data types我在想df.groupby('filename').transform(lambda x: x['pred'].ne(x['gt']).sum(), axis=1)但我得到TypeError: Transform function invalid for data types

.transform operates on each column individually, so you won't be able to access both 'pred' and 'gt' in a transform operation. .transform单独对每一列进行操作,因此您将无法在转换操作中同时访问“pred”和“gt”。

This leaves you with 2 options:这为您提供了 2 个选项:

  1. aggregate and reindex or join back to the original shape聚合并重新索引或连接回原始形状
  2. pre-compute the boolean array and .transform on that预先计算 boolean 数组并对其进行.transform

approach 2 will probably be the fastest here:方法2可能是这里最快的:

df['counts'] = (
    (df['pred'] != df['gt'])
    .groupby(df['filename']).transform('sum')
)

print(df)
    pred  gt filename  counts
0      0   0   f1.wav       1
1      0   0   f1.wav       1
2      2   0   f1.wav       1
3      0   0   f1.wav       1
4      0   0   f1.wav       1
5      0   0   f1.wav       1
6      0   0   f1.wav       1
7      0   0   f1.wav       1
8      0   0   f1.wav       1
9      0   0   f1.wav       1
10     0   0   f2.wav       4
11     0   0   f2.wav       4
12     2   2   f2.wav       4
13     0   2   f2.wav       4
14     0   2   f2.wav       4
15     0   0   f2.wav       4
16     0   0   f2.wav       4
17     2   2   f2.wav       4
18     0   2   f2.wav       4
19     2   0   f2.wav       4

Note that f2.wav has 4 instances where 'pre',= 'gt' (index 13, 14, 18, 19)请注意, f2.wav有 4 个实例,其中 'pre',= 'gt' (索引 13、14、18、19)

Considering that df is the dataframe OP shares in the question, in order to groupby filename and check count where pred != gt , one can use pandas.DataFrame.groupby and pandas.DataFrame.apply as follows Considering that df is the dataframe OP shares in the question, in order to groupby filename and check count where pred != gt , one can use pandas.DataFrame.groupby and pandas.DataFrame.apply as follows

df2 = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']])

[Out]:
             pred  gt filename
filename                      
f1.wav   2      2   0   f1.wav
f2.wav   13     0   2   f2.wav
         14     0   2   f2.wav
         18     0   2   f2.wav
         19     2   0   f2.wav

Assuming one wants to count the number of occurrences for each filename , as, after the previous operation, filename is both an index level and a column label, which is ambiguous, and considering that OP wants to have a column named count to count the number of each item in each group, one will have to groupby level (one of the various parameters one can pass), and, finally, use pandas.core.groupby.GroupBy.cumcount .假设要计算每个filename的出现次数,因为在前面的操作之后, filename既是索引级别又是列 label,这是模棱两可的,并且考虑到 OP 希望有一个名为count的列来计算数量对于每组中的每个项目,必须按级别pandas.core.groupby.GroupBy.cumcount groupby (Note: As opposed to the accepted answer , this approach will count sequentially) (注意:与接受的答案相反,这种方法将按顺序计算)

df2['count'] = df2.groupby(level=0).cumcount() + 1 # The +1 is to make the count start at 1 instead of 0.

[Out]:
             pred  gt filename  count
filename                             
f1.wav   2      2   0   f1.wav      1
f2.wav   13     0   2   f2.wav      1
         14     0   2   f2.wav      2
         18     0   2   f2.wav      3
         19     2   0   f2.wav      4

A one-liner would look like the following单线将如下所示

df2['count'] = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']]).groupby(level=0).cumcount() + 1

[Out]:
             pred  gt filename  count
filename                             
f1.wav   2      2   0   f1.wav      1
f2.wav   13     0   2   f2.wav      1
         14     0   2   f2.wav      2
         18     0   2   f2.wav      3
         19     2   0   f2.wav      4

If having the count in a separate column is not a requirement, considering df2 as the dataframe after the first operation mentioned in this answer (when df2 was created), then one can simply use the following (which gives a more high-level overview)如果不需要在单独的列中进行计数,则在此答案中提到的第一个操作(创建df2时)之后将df2视为 dataframe ,那么可以简单地使用以下内容(提供更高级的概述)

df3 = df2.groupby(level=0).count().iloc[:, 0]

[Out]:
filename
f1.wav    1
f2.wav    4
Name: pred, dtype: int64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM