[英]Pandas groupby and transform based on multiple columns
I have seen a lot of similar questions but none seem to work for my case.我见过很多类似的问题,但似乎没有一个适合我的情况。 I'm pretty sure this is just a groupby transform but I keep getting
KeyError
along with axis
issues.我很确定这只是一个 groupby 转换,但我不断收到
KeyError
以及axis
问题。 I am trying to groupby filename
and check count where pred != gt
.我正在尝试对
filename
进行分组并检查pred != gt
的计数。
For example Index 2 is the only one for f1.wav
so 1, and Index (13,14,18) for f2.wav
so 3.例如,索引 2 是
f1.wav
所以 1 的唯一一个,而f2.wav
所以 3 是索引 (13,14,18)。
df = pd.DataFrame([{'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f2.wav'}])
pred gt filename
0 0 0 f1.wav
1 0 0 f1.wav
2 2 0 f1.wav
3 0 0 f1.wav
4 0 0 f1.wav
5 0 0 f1.wav
6 0 0 f1.wav
7 0 0 f1.wav
8 0 0 f1.wav
9 0 0 f1.wav
10 0 0 f2.wav
Expected output预期 output
pred gt filename counts
0 0 0 f1.wav 1
1 0 0 f1.wav 1
2 2 0 f1.wav 1
3 0 0 f1.wav 1
4 0 0 f1.wav 1
5 0 0 f1.wav 1
6 0 0 f1.wav 1
7 0 0 f1.wav 1
8 0 0 f1.wav 1
9 0 0 f1.wav 1
10 0 0 f2.wav 3
11 0 0 f2.wav 3
12 2 2 f2.wav 3
13 0 2 f2.wav 3
14 0 2 f2.wav 3
15 0 0 f2.wav 3
16 0 0 f2.wav 3
17 2 2 f2.wav 3
18 0 2 f2.wav 3
19 2 0 f2.wav 3
I was thinking df.groupby('filename').transform(lambda x: x['pred'].ne(x['gt']).sum(), axis=1)
but I get TypeError: Transform function invalid for data types
我在想
df.groupby('filename').transform(lambda x: x['pred'].ne(x['gt']).sum(), axis=1)
但我得到TypeError: Transform function invalid for data types
.transform
operates on each column individually, so you won't be able to access both 'pred' and 'gt' in a transform operation. .transform
单独对每一列进行操作,因此您将无法在转换操作中同时访问“pred”和“gt”。
This leaves you with 2 options:这为您提供了 2 个选项:
.transform
on that.transform
approach 2 will probably be the fastest here:方法2可能是这里最快的:
df['counts'] = (
(df['pred'] != df['gt'])
.groupby(df['filename']).transform('sum')
)
print(df)
pred gt filename counts
0 0 0 f1.wav 1
1 0 0 f1.wav 1
2 2 0 f1.wav 1
3 0 0 f1.wav 1
4 0 0 f1.wav 1
5 0 0 f1.wav 1
6 0 0 f1.wav 1
7 0 0 f1.wav 1
8 0 0 f1.wav 1
9 0 0 f1.wav 1
10 0 0 f2.wav 4
11 0 0 f2.wav 4
12 2 2 f2.wav 4
13 0 2 f2.wav 4
14 0 2 f2.wav 4
15 0 0 f2.wav 4
16 0 0 f2.wav 4
17 2 2 f2.wav 4
18 0 2 f2.wav 4
19 2 0 f2.wav 4
Note that f2.wav
has 4 instances where 'pre',= 'gt' (index 13, 14, 18, 19)请注意,
f2.wav
有 4 个实例,其中 'pre',= 'gt' (索引 13、14、18、19)
Considering that df
is the dataframe OP shares in the question, in order to groupby filename
and check count where pred
!=
gt
, one can use pandas.DataFrame.groupby
and pandas.DataFrame.apply
as follows Considering that
df
is the dataframe OP shares in the question, in order to groupby filename
and check count where pred
!=
gt
, one can use pandas.DataFrame.groupby
and pandas.DataFrame.apply
as follows
df2 = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']])
[Out]:
pred gt filename
filename
f1.wav 2 2 0 f1.wav
f2.wav 13 0 2 f2.wav
14 0 2 f2.wav
18 0 2 f2.wav
19 2 0 f2.wav
Assuming one wants to count the number of occurrences for each filename
, as, after the previous operation, filename
is both an index level and a column label, which is ambiguous, and considering that OP wants to have a column named count
to count the number of each item in each group, one will have to groupby
level (one of the various parameters one can pass), and, finally, use pandas.core.groupby.GroupBy.cumcount
.假设要计算每个
filename
的出现次数,因为在前面的操作之后, filename
既是索引级别又是列 label,这是模棱两可的,并且考虑到 OP 希望有一个名为count
的列来计算数量对于每组中的每个项目,必须按级别pandas.core.groupby.GroupBy.cumcount
groupby
(Note: As opposed to the accepted answer , this approach will count sequentially) (注意:与接受的答案相反,这种方法将按顺序计算)
df2['count'] = df2.groupby(level=0).cumcount() + 1 # The +1 is to make the count start at 1 instead of 0.
[Out]:
pred gt filename count
filename
f1.wav 2 2 0 f1.wav 1
f2.wav 13 0 2 f2.wav 1
14 0 2 f2.wav 2
18 0 2 f2.wav 3
19 2 0 f2.wav 4
A one-liner would look like the following单线将如下所示
df2['count'] = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']]).groupby(level=0).cumcount() + 1
[Out]:
pred gt filename count
filename
f1.wav 2 2 0 f1.wav 1
f2.wav 13 0 2 f2.wav 1
14 0 2 f2.wav 2
18 0 2 f2.wav 3
19 2 0 f2.wav 4
If having the count in a separate column is not a requirement, considering df2
as the dataframe after the first operation mentioned in this answer (when df2
was created), then one can simply use the following (which gives a more high-level overview)如果不需要在单独的列中进行计数,则在此答案中提到的第一个操作(创建
df2
时)之后将df2
视为 dataframe ,那么可以简单地使用以下内容(提供更高级的概述)
df3 = df2.groupby(level=0).count().iloc[:, 0]
[Out]:
filename
f1.wav 1
f2.wav 4
Name: pred, dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.