简体   繁体   中英

How to use pandas to group items by classes bounded by a difference less that 4

I wonder how to create classes of items grouped by its difference <=4, so 1,2,3,4,5 will be grouped into 1, 9-13 to 9 ... and then select the min/max values of the attribute y, in an efficient/easy way:

items= [('x', [ 1,2,3,3,3,5,9,10,11,13]), ('y', [1,1,1,1,1,4,4,1,1,1])]

In[3]: pd.DataFrame.from_items(items) Out[3]: xy 0 1 1 1 2 1 2 3 1 3 3 1 4 3 1 5 5 4 6 9 5 7 10 1 8 11 1 9 13 1

So the result I expect would be:

xclass ymax ymin 1 4 1 9 5 1 I did it with iterating without pandas but I would like test performace with pandas.

Such operations are usually done in two steps:

  1. Create a key to group by.
  2. Calculate aggregate statistics with groupby.

I assume you have dataframe df defined as

df = pd.DataFrame.from_items([('x', [ 1,2,3,3,3,5,9,10,11,13]), 
    ('y', [1,1,1,1,1,4,4,1,1,1])])

The first step is not defined very well in you question. How to draw borders between groups, if the data is dense? For example, what would you want to do with groups if you had df['x'] = [ 1,2,3,3,5,7,9,10,11,13] ?

The simplest idea is to round x to the precision you want. This ensures that distance between any integers in the group does not exceed 4. But the groups will be placed without gaps: 1-5 to 5, 6-10 to 10, 11-15 to 15, etc.

def custom_round(x, precision, offset):
    return ((x-offset) // precision) * precision + offset
df['xclass'] = custom_round(df['x'], 5, 1)

Another idea is to have groups that are dense enough : two groups can be merged, if the minimal distance between them is less than threshold. Such algorirthm can produce large groups divided by gaps wider than threshold. It can be implemented with a DBSCAN clustering algorithm. To have the groups you want, you can set the threshold distance to 3 (because distance between 5 and 9 is already 4):

from sklearn.cluster import DBSCAN
def cluster(x, threshold):
    labels = DBSCAN(eps=3, min_samples=1).fit(np.array(x)[:, np.newaxis]).labels_ 
    return x.groupby(labels).transform(min)
df['xclass'] = cluster(df['x'], 3)

The second step is easy: having dataframe df with columns xclass and y , call:

df.groupby('xclass')['y'].aggregate([min, max]).reset_index()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM