How to use pandas to group items by classes bounded by a difference less that 4

Question

I wonder how to create classes of items grouped by its difference <=4, so 1,2,3,4,5 will be grouped into 1, 9-13 to 9 ... and then select the min/max values of the attribute y, in an efficient/easy way:

items= [('x', [ 1,2,3,3,3,5,9,10,11,13]), ('y', [1,1,1,1,1,4,4,1,1,1])]

In[3]: pd.DataFrame.from_items(items) Out[3]: xy 0 1 1 1 2 1 2 3 1 3 3 1 4 3 1 5 5 4 6 9 5 7 10 1 8 11 1 9 13 1

So the result I expect would be:

xclass ymax ymin 1 4 1 9 5 1 I did it with iterating without pandas but I would like test performace with pandas.

Answer 1

Such operations are usually done in two steps:

Create a key to group by.
Calculate aggregate statistics with groupby.

I assume you have dataframe df defined as

df = pd.DataFrame.from_items([('x', [ 1,2,3,3,3,5,9,10,11,13]), 
    ('y', [1,1,1,1,1,4,4,1,1,1])])

The first step is not defined very well in you question. How to draw borders between groups, if the data is dense? For example, what would you want to do with groups if you had df['x'] = [ 1,2,3,3,5,7,9,10,11,13] ?

The simplest idea is to round x to the precision you want. This ensures that distance between any integers in the group does not exceed 4. But the groups will be placed without gaps: 1-5 to 5, 6-10 to 10, 11-15 to 15, etc.

def custom_round(x, precision, offset):
    return ((x-offset) // precision) * precision + offset
df['xclass'] = custom_round(df['x'], 5, 1)

Another idea is to have groups that are dense enough : two groups can be merged, if the minimal distance between them is less than threshold. Such algorirthm can produce large groups divided by gaps wider than threshold. It can be implemented with a DBSCAN clustering algorithm. To have the groups you want, you can set the threshold distance to 3 (because distance between 5 and 9 is already 4):

from sklearn.cluster import DBSCAN
def cluster(x, threshold):
    labels = DBSCAN(eps=3, min_samples=1).fit(np.array(x)[:, np.newaxis]).labels_ 
    return x.groupby(labels).transform(min)
df['xclass'] = cluster(df['x'], 3)

The second step is easy: having dataframe df with columns xclass and y , call:

df.groupby('xclass')['y'].aggregate([min, max]).reset_index()

How to use pandas to group items by classes bounded by a difference less that 4

Question

1 answers

solution1
1 ACCPTED 2017-10-27 10:41:25

How to use pandas to group items by classes bounded by a difference less that 4

Question

1 answers

solution1 1 ACCPTED 2017-10-27 10:41:25

solution1
1 ACCPTED 2017-10-27 10:41:25