简体   繁体   English

基于应用于不同列的多个逻辑条件的 Groupby DataFrame

[英]Groupby based on a multiple logical conditions applied to a different columns DataFrame

I have this dataframe:我有这个数据框:

df = pd.DataFrame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})

    value    type  entity
0       1     big       R
1       2   small       R
2       3  medium       R
3       4     big       P
4       2     big       R
5      42     big       P
6      12     big       P
7      21  medium       P
8      21   small       R
9     424   small       R
10     34   small       P
11     12  medium       R
12     42   small       R

The operation consists of grouping by column 'entity' doing a count operation based on a two logical conditions applied to a column 'value' and column 'type'.该操作包括按列“实体”分组,根据应用于列“值”和列“类型”的两个逻辑条件进行计数操作。 In my case, I have to count the values greater than 3 in the column 'name' and are not equal to 'medium' in the column 'type'.就我而言,我必须计算“名称”列中大于 3 且不等于“类型”列中的“中”的值。 The result must be R=3 and P=4.结果必须是 R=3 和 P=4。 After this, I must add the result to the original dataframe creating a new column named 'Count'.在此之后,我必须将结果添加到原始数据框中,创建一个名为“Count”的新列。 I know this operation can be done in R with the next code:我知道这个操作可以用下面的代码在 R 中完成:

df[y!='medium' & value>3 , new_var:=.N,by=entity]
df[is.na(new_var),new_var:=0,]
df[,new_var:=max(new_var),by=entity]

In a previous task, I had to calculate only the values greater than 3 as condition.在之前的任务中,我只需要计算大于 3 的值作为条件。 In that case, the result was R=3 and P=4 and I got it applying the next code:在那种情况下,结果是 R=3 和 P=4,我得到了它应用下一个代码:

In []:  df.groupby(['entity'])['value'].apply(lambda x: (x>3).sum())

Out[]:  entity
        P    5
        R    4
        Name: value, dtype: int64

In []:  DF=pd.DataFrame(DF)
In []:  DF.reset_index(inplace=True)
In []:  df.merge(DF,on=['entity'],how='inner')
In []:  df=df.rename(columns={'value_x':'value','value_y':'count'},inplace=True)
Out[]:  

    value   type     entity  count
0      1     big          R      4
1      2   small          R      4
2      3  medium          R      4
3      2     big          R      4
4     21   small          R      4
5    424   small          R      4
6     12  medium          R      4
7     42   small          R      4
8      4     big          P      5
9     42     big          P      5
10    12     big          P      5
11    21  medium          P      5
12    34   small          P      5

My questions are: How do I do it for the two conditions case?我的问题是:对于这两种情况,我该怎么做? In fact, How do I do it for a general case with multiples different conditions?事实上,对于具有多种不同条件的一般情况,我该如何做?

Create mask by your conditions - here for greater by Series.gt with not equal by Series.ne chained by & for bitwise AND and then use GroupBy.transform for count True s by sum :根据您的条件创建掩码 - 此处为更大的Series.gt与不等于Series.ne&链接的按位AND然后使用GroupBy.transform计算True s by sum

mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = mask.groupby(df['entity']).transform('sum')

Solution with helper column new :使用辅助列new解决方案:

mask = df['value'].gt(3) & df['type'].ne('medium')
df['count'] = df.assign(new = mask).groupby('entity')['new'].transform('sum')

print (df)
    value    type entity  count
0       1     big      R      3
1       2   small      R      3
2       3  medium      R      3
3       4     big      P      4
4       2     big      R      3
5      42     big      P      4
6      12     big      P      4
7      21  medium      P      4
8      21   small      R      3
9     424   small      R      3
10     34   small      P      4
11     12  medium      R      3
12     42   small      R      3

The solution in Pandas is superb. Pandas 中的解决方案非常棒。 This is an alternative in a different package.这是不同包装中的替代方案。 The reason I am throwing this in here is because the original code was in data.table in R, and it might be useful for others, who probably want a similar solution within Python.我在这里抛出这个的原因是因为原始代码在 R 中的data.table中,它可能对其他人有用,他们可能想要在 Python 中使用类似的解决方案。

This is a solution in pydatatable , a library that aims to replicate data.table in python.这是pydatatable 中的一个解决方案,一个旨在在 python 中复制data.table的库。 Note that it is not as feature rich as Pandas;请注意,它不像 Pandas 那样功能丰富; hopefully, with time, more features will be added.希望随着时间的推移,将添加更多功能。

Create the frame with datatable :创建一个框架datatable

   from datatable import dt, f, by, update

    df = dt.Frame({'value':[1,2,3,4,2,42,12,21,21,424,34,12,42],
'type':['big','small','medium','big','big','big','big','medium','small','small','small','medium','small'],
'entity':['R','R','R','P','R','P','P','P','R','R','P','R','R']})

Create the condition - In datatable, the f symbol is a shortcut to refer to the dataframe:创建条件 - 在数据表中, f符号是引用数据框的快捷方式:

condition = (f.type!="medium") & (f.value>3)

The syntax below should be familiar to users of data.table , data.table用户应该熟悉以下语法,

 DT[i, j, by] 

where i refers to anything that can occur in the rows, j refers to column operations, and by is for grouping operations.其中i指的是行中可能出现的任何内容, j指的是列操作,而by用于分组操作。 The update function is similar in function to the := function in data.table ; update函数在功能上类似于data.table:=函数; it allows for creation of new columns or update of existing columns in place.它允许创建新列或更新现有列。

df[:, update(count=dt.sum(condition)), by('entity')]

df

 value  type    entity  count
0   1   big     R       3
1   2   small   R       3
2   3   medium  R       3
3   4   big     P       4
4   2   big     R       3
5   42  big     P       4
6   12  big     P       4
7   21  medium  P       4
8   21  small   R       3
9   424 small   R       3
10  34  small   P       4
11  12  medium  R       3
12  42  small   R       3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM