简体   繁体   中英

how to filter groupby object in pandas based on difference of values within the group?

I have a dataframe as listed below:

In []: dff = pd.DataFrame({'A': np.arange(8),
                           'B': list('aabbbbcc'),
                           'C':np.random.randint(100,size=8)})

which i have grouped based on column B

  In []: grouped = dff.groupby('B')

Now, I want to filter the dff based on difference of values in column 'C' . For example, if the difference between any two points within the group in column C is greater than a threshold, remove that row.

If dff is:

   A  B   C
0  0  a  18
1  1  a  25
2  2  b  56
3  3  b  62
4  4  b  46
5  5  b  56
6  6  c  74
7  7  c   3

Then, a threshold of 10 for C will produce a final table like:

   A  B   C
0  0  a  18
1  1  a  25
2  2  b  56
3  3  b  62
4  4  b  46
5  5  b  56

here the grouped category c (small letter) is removed as the difference between the two is greater than 10, but category b has all the rows intact as they are all within 10 of each other.

I think I'd do the hard work in numpy:

In [11]: a = np.array([2, 3, 14, 15, 54])

In [12]: res = np.abs(a[:, np.newaxis] - a) < 10  # Note: perhaps you want <= 10.

In [13]: np.fill_diagonal(res, False)

In [14]: res.any(0)
Out[14]: array([ True,  True,  True,  True, False], dtype=bool)

You could wrap this in a function:

In [15]: def has_close(a, n=10):
              res = np.abs(a[:, np.newaxis] - a) < n
              np.fill_diagonal(res, False)
              return res.any(0)

In [16]: g = df.groupby('B', as_index=False)

In [17]: g.C.apply(lambda x: x[has_close(x.C.values)])
Out[17]: 
   A  B   C
0  0  a  18
1  1  a  25
2  2  b  56
3  3  b  62
5  5  b  56

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM