简体   繁体   中英

Variable bins for each row in pandas dataframe

Given a coordinate dataframe such as df1 = pd.DataFrame({'x': np.tile(np.arange(20),5), 'y': np.repeat(np.arange(5),20)})

I would like to bin each x value however, the number of bins varies for each row. More specifically, the number of bins is dependent on the y value.

eg point x=6 and y=2 if the number of bins = y+1 = 3 then the bins for this row are (0, 6.33], (6.33, 12.67], (12.67, 19] and the resulting bin is (0, 6.33]

Part of the resulting dataframe would look like:

x    y    xbinned
18   2    (12.67, 19]
19   2    (12.67, 19]
0    3    (0, 4.75]
1    3    (0, 4.75]

The following generates the desired bins:

xbins = []

for y in df1.y:
    xbins.append(np.linspace(df1['x'].min(), df1['x'].max(), y+1))

But cannot be used in the cut:

df['xbinned'] = pd.cut(df.x, bins=xbins)

since it expects a 1d array not 2d.

Where do I go from here? I think I would be able to do this using loops, but was hoping to use the pandas functions for a more vectorised solution.

IIUC:

df1['xbinned'] = (df1.groupby('y')
                     .apply(lambda d: pd.cut(d['x'], bins=d['y'][0]+1))
                     .reset_index(level=0, drop=True)
                 )

Output (partial)

     x  y         xbinned
18  18  0  (-0.019, 19.0]
19  19  0  (-0.019, 19.0]
38  18  1     (9.5, 19.0]
39  19  1     (9.5, 19.0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM