简体   繁体   中英

An effective way to perform group by with a filter

I need to group by a data frame and apply some filter and I don't sure how to do that...

Assume there is 3 columns: group, distance, value , group is the column of group by, distance is the column that I want apply the filter, and value is the column that I want to take if the filter is return true.

Take a look what I did:

from numpy import around
from numpy.random import uniform
from pandas import DataFrame

data = around(a=uniform(low=1.0, high=50.0, size=(20, 3)), decimals=3)
df = DataFrame(data=data, columns=['group', 'distance', 'value'], dtype='float64')

rows, columns = df.shape
df.loc[:rows // 2, 'group'] = 1.0
df.loc[rows // 2:, 'group'] = 2.0

print(df)

df.loc[:, 'next_distance'] = df.groupby(by='group')['distance'].shift(periods=-1)
df.loc[:, 'next_value'] = df.groupby(by='group')['value'].shift(periods=-1)
distance_filter = df.loc[:, 'next_distance'] - df.loc[:, 'distance'] > 10.0
df.loc[distance_filter, 'new_value'] = df.loc[distance_filter, 'next_value']

print(df)

The first print of df is:

    group  distance   value
0     1.0     3.757  30.593
1     1.0    14.770  13.313
2     1.0    12.594  38.865
3     1.0    47.806  36.357
4     1.0     7.930  28.235
5     1.0     6.133  42.323
6     1.0    23.422   4.883
7     1.0    12.706   1.606
8     1.0    29.787  48.096
9     1.0    41.889  24.148
10    2.0    15.712  28.568
11    2.0    38.143  20.496
12    2.0    24.282   9.562
13    2.0    25.148  26.535
14    2.0    44.163  42.303
15    2.0    38.116  17.947
16    2.0     4.716  17.259
17    2.0    11.980   4.369
18    2.0    35.533  20.866
19    2.0    11.921  47.971

The second print of df is:

    group  distance   value  next_distance  next_value  new_value
0     1.0     3.757  30.593         14.770      13.313     30.593
1     1.0    14.770  13.313         12.594      38.865        NaN
2     1.0    12.594  38.865         47.806      36.357     38.865
3     1.0    47.806  36.357          7.930      28.235        NaN
4     1.0     7.930  28.235          6.133      42.323        NaN
5     1.0     6.133  42.323         23.422       4.883     42.323
6     1.0    23.422   4.883         12.706       1.606        NaN
7     1.0    12.706   1.606         29.787      48.096      1.606
8     1.0    29.787  48.096         41.889      24.148     48.096
9     1.0    41.889  24.148            NaN         NaN        NaN
10    2.0    15.712  28.568         38.143      20.496     28.568
11    2.0    38.143  20.496         24.282       9.562        NaN
12    2.0    24.282   9.562         25.148      26.535        NaN
13    2.0    25.148  26.535         44.163      42.303     26.535
14    2.0    44.163  42.303         38.116      17.947        NaN
15    2.0    38.116  17.947          4.716      17.259        NaN
16    2.0     4.716  17.259         11.980       4.369        NaN
17    2.0    11.980   4.369         35.533      20.866      4.369
18    2.0    35.533  20.866         11.921      47.971        NaN
19    2.0    11.921  47.971            NaN         NaN        NaN

All I need is the new_value column, there is a way to do it better?

You can use grouoby with both columns and then subtract df1['distance'] - df['distance'] :

df1 = df.groupby(by='group')[['distance','value']].shift(periods=-1)
distance_filter = df1['distance'] - df['distance'] > 10.0
df.loc[distance_filter, 'new_value'] = df1.loc[distance_filter, 'value']

print(df)
    group  distance   value  new_value
0     1.0    26.097  16.973     16.973
1     1.0    36.866  28.804        NaN
2     1.0    28.644  17.779        NaN
3     1.0    19.339  44.409        NaN
4     1.0     5.768  28.003     28.003
5     1.0    40.646   3.632        NaN
6     1.0    20.141   8.516        NaN
7     1.0    17.949  46.639        NaN
8     1.0    23.825  45.374        NaN
9     1.0    11.013  33.044        NaN
10    2.0    42.859  39.162        NaN
11    2.0    45.025  17.099        NaN
12    2.0     7.124  19.366     19.366
13    2.0    22.728  23.045     23.045
14    2.0    34.603  46.527     46.527
15    2.0    45.901  40.602        NaN
16    2.0    20.585  11.294        NaN
17    2.0    27.979  24.360        NaN
18    2.0    15.374   5.726      5.726
19    2.0    27.611  17.011        NaN

If need same output only a bit change:

df=df.join(df.groupby('group')[['distance','value']].shift(periods=-1).add_prefix('next_'))
distance_filter = df['next_distance'] - df['distance'] > 10.0
df.loc[distance_filter, 'new_value'] = df.loc[distance_filter, 'next_value']

print(df)
    group  distance   value  next_distance  next_value  new_value
0     1.0    12.253  29.438         28.814      38.660     29.438
1     1.0    28.814  38.660         20.756      24.588        NaN
2     1.0    20.756  24.588         16.776      11.183        NaN
3     1.0    16.776  11.183          7.214      47.655        NaN
4     1.0     7.214  47.655         17.083      17.805        NaN
5     1.0    17.083  17.805         24.074       4.120        NaN
6     1.0    24.074   4.120         40.108      48.605      4.120
7     1.0    40.108  48.605         40.571       1.591        NaN
8     1.0    40.571   1.591         30.987      36.448        NaN
9     1.0    30.987  36.448            NaN         NaN        NaN
10    2.0    37.585  13.128          9.864      18.969        NaN
11    2.0     9.864  18.969         46.241      39.490     18.969
12    2.0    46.241  39.490         40.612       7.873        NaN
13    2.0    40.612   7.873         39.053      16.816        NaN
14    2.0    39.053  16.816         13.665      32.730        NaN
15    2.0    13.665  32.730         35.349      43.783     32.730
16    2.0    35.349  43.783         11.412      19.120        NaN
17    2.0    11.412  19.120         40.855      41.502     19.120
18    2.0    40.855  41.502         16.973      40.430        NaN
19    2.0    16.973  40.430            NaN         NaN        NaN

EDIT:

df1 = df[['group']].join(df.groupby(by='group')[['distance','value']].shift(periods=-1))
print (df1)
    group  distance   value
0     1.0    44.142  10.032
1     1.0    14.315  30.959
2     1.0    31.881  44.687
3     1.0    25.850   2.651
4     1.0    40.928   9.444
5     1.0     2.230  18.175
6     1.0    22.793  21.242
7     1.0     2.378  19.381
8     1.0    10.907  29.599
9     1.0       NaN     NaN
10    2.0    32.876  24.147
11    2.0    38.133  41.621
12    2.0    39.026  39.042
13    2.0    19.474   5.325
14    2.0    31.824   6.052
15    2.0    46.525  49.705
16    2.0    17.858  48.050
17    2.0    14.817   9.273
18    2.0    24.547  16.233
19    2.0       NaN     NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM