简体   繁体   中英

Python: Binning one coordinate and averaging another based on these bins

I have two vectors rev_count and stars . The elements of those form pairs (let's say rev_count is the x coordinate and stars is the y coordinate).

I would like to bin the data by rev_count and then average the stars in a single rev_count bin (I want to bin along the x axis and compute the average y coordinate in that bin).

This is the code that I tried to use (inspired by my matlab background):

import matplotlib.pyplot as plt
import numpy

binwidth = numpy.max(rev_count)/10
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)

for i in range(0, len(revbin)-1):
    revbinnedstars[i] = numpy.mean(stars[numpy.argwhere((revbin[i]-binwidth/2) < rev_count < (revbin[i]+binwidth/2))])

print('Plotting binned stars with count')
plt.figure(3)
plt.plot(revbin, revbinnedstars, '.')
plt.show()

However, this seems to be incredibly slow/inefficient. Is there a more natural way to do this in python?

Scipy has a function for this:

from scipy.stats import binned_statistic

revbinnedstars, edges, _ = binned_statistic(rev_count, stars, 'mean', bins=10)
revbin = edges[:-1]

If you don't want to use scipy there's also a histogram function in numpy:

sums, edges = numpy.histogram(rev_count, bins=10, weights=stars)
counts, _ = numpy.histogram(rev_count, bins=10)
revbinnedstars = sums / counts

I suppose you are using Python 2 but if not you should change the division when calculating the step to // (floor division) otherwise numpy will be annoyed that it cannot interpret floats as step.

binwidth = numpy.max(rev_count)//10 # Changed this to floor division
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)

for i in range(0, len(revbin)-1):
    # I actually don't know what you wanted to do but I guess you wanted the
    # "logical and" combination in that bin (you don't need to use np.where here)
    # You can put that all in one statement but it gets crowded so I'll split it:
    index1 = revbin[i]-binwidth/2 < rev_count
    index2 = rev_count < revbin[i]+binwidth/2)
    revbinnedstars[i] = numpy.mean(stars[np.logical_and(index1, index2)])

That at least should work and gives the right results. It will be very inefficient if you have huge datasets and want more than 10 bins.

One very important takeaway:

  • Don't use np.argwhere if you want to index an array. That result is just supposed to be human readable . If you really want the coordinates use np.where . That can be used as index but isn't that pretty to read if you have multidimensional inputs.

The numpy documentation supports me on that point:

The output of argwhere is not suitable for indexing arrays. For this purpose use where(a) instead.

That's also the reason why your code was so slow. It tried to do something you don't want it to do and which can be very expensive in memory and cpu usage. Without giving you the right result.

What I have done here is called boolean masks . It's shorter to write than np.where(condition) and involves one less calculation.


A completly vectorized approach could be used by defining a grid that knows which stars are in which bin:

bins = 10
binwidth = numpy.max(rev_count)//bins
revbin = np.arange(0, np.max(rev_count)+binwidth+1, binwidth)

an even better approach for defining the bins would be. Beware that you have to add one to the maximum since you want to include it and one to the number of bins because you are interested in the bin-start and end-points not the center of the bins:

number_of_bins = 10
revbin = np.linspace(np.min(rev_count), np.max(rev_count)+1, number_of_bins+1)

and then you can setup the grid:

grid = np.logical_and(rev_count[None, :] >= revbin[:-1, None], rev_count[None, :] < revbin[1:, None])

The grid is bins x rev_count big (because of the broadcasting, I increased the dimensions of each of those arrays by one BUT not the same). This essentially checkes if a point is bigger than the lower bin range and smaller than the upper bin range (therefore the [:-1] and [1:] indices). This is done multidimensional where the counts are in the second dimension (numpy axis=1) and the bins in the first dimension (numpy axis=0)

So we can get the Y coordinates of the stars in the appropriate bin by just multiplying these with this grid:

stars * grid

To calculate the mean we need the sum of the coordinates in this bin and divide it by the number of stars in that bin (bins are along the axis=1 , stars that are not in this bin only have a value of zero along this axis):

revbinnedstars = np.sum(stars * grid, axis=1) / np.sum(grid, axis=1)

I actually don't know if that's more efficient. It'll be a lot more expensive in memory but maybe a bit less expensive in CPU.

The function I use for binning (x,y) data and determining summary statistics such as mean values in those bins is based upon the scipy.stats.statistic() function. I have written a wrapper for it, because I use it a lot. You may find this useful...

def binXY(x,y,statistic='mean',xbins=10,xrange=None):
    """
    Finds statistical value of x and y values in each x bin. 
    Returns the same type of statistic for both x and y.
    See scipy.stats.binned_statistic() for options.
    
    Parameters
    ----------
    x : array
        x values.
    y : array
        y values.
    statistic : string or callable, optional
        See documentation for scipy.stats.binned_statistic(). Default is mean.
    xbins : int or sequence of scalars, optional
        If xbins is an integer, it is the number of equal bins within xrange.
        If xbins is an array, then it is the location of xbin edges, similar
        to definitions used by np.histogram. Default is 10 bins.
        All but the last (righthand-most) bin is half-open. In other words, if 
        bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1, but 
        excluding 2) and the second [2, 3). The last bin, however, is [3, 4], 
        which includes 4.    
        
    xrange : (float, float) or [(float, float)], optional
        The lower and upper range of the bins. If not provided, range is 
        simply (x.min(), x.max()). Values outside the range are ignored.
    
    Returns
    -------
    x_stat : array
        The x statistic (e.g. mean) in each bin. 
    y_stat : array
        The y statistic (e.g. mean) in each bin.       
    n : array of dtype int
        The count of y values in each bin.
        """
    x_stat, xbin_edges, binnumber = stats.binned_statistic(x, x, 
                                 statistic=statistic, bins=xbins, range=xrange)
    
    y_stat, xbin_edges, binnumber = stats.binned_statistic(x, y, 
                                 statistic=statistic, bins=xbins, range=xrange)
    
    n, xbin_edges, binnumber = stats.binned_statistic(x, y, 
                                 statistic='count', bins=xbins, range=xrange)
            
    return x_stat, y_stat, n

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM