简体   繁体   中英

Percentage of array between values

Im searching for an easy way to find which percentage of the data is within certain intervals using python.

Consider an array X of float values. I'd like to do something similar to quantiles:

X.quantile(np.linspace(0,1,11))

But instead, I'd like to know which percentage of values are within -10 and 10, for example.

X.method([-10,10])

I know I can do that with scipy.stats.percentileofscore doing

percentileofscore(X,10) - percentileofscore(X,-10)

I was wondering whether there's a simpler, implemented solution so I could do instead

X.method([a,b,c])

Which would give me the percentage of values between the min(X) and a , a and b , b and c , and finally between c and max(X)

A simple solution is to use np.histogram :

import numpy as np
X = np.arange(20)
values = [5, 13]  # these are your a and b
freq = np.histogram(X, bins=[-np.inf] + values + [np.inf])[0]/X.size
print(freq)
>> array([0.25, 0.4 , 0.35])

Basic Numpy and Pandas solutions

There's no completely prepackaged method (in Numpy), but there's lots of one liners. Here's how to do it using comparison and logical ops ( Edit tip of the hat to Paul Panzer for suggesting the use of np.count_nonzero ):

import numpy as np

arr = np.linspace(-15,15,1000)
np.count_nonzero((arr > -10) & (arr < 10))/arr.size

Output:

0.666

If you're willing to use Pandas, the pandas.Series.between method gets you a little closer to the complete package you want:

import pandas as pd

sr = pd.Series(np.linspace(-15,15,1000))
np.count_nonzero(sr.between(-10,10))/sr.size

Output:

0.666

Pitfalls

Every interval analysis method involves an explicit or implicit definition of the interval that you're considering. Is the interval closed (ie inclusive of the extreme values) on both ends, like [-10, 10] ? Or is it half-open (ie excludes the extreme value on one end), like [-10, 10) ? And so forth.

This tends not to be an issue when dealing with arrays of float values taken from data (since it's unlikely any of the data falls exactly on the extremes), but can cause serious problems when working with arrays of int . For example, the two methods I listed above can give different results if the array includes the boundary values of the interval:

arr = np.arange(-15,16)
print(np.count_nonzero((arr > -10) & (arr < 10))/arr.size)
print(np.count_nonzero(pd.Series(arr).between(-10,10))/arr.size)

Output:

0.6129032258064516
0.6774193548387096

The pd.Series.between method defaults to to a closed interval on both ends, so to match it in Numpy you'd have to use the inclusive comparison operators:

arr = np.arange(-15,16)
print(np.count_nonzero((arr >= -10) & (arr <= 10))/arr.size)
print(np.count_nonzero(pd.Series(arr).between(-10,10))/arr.size)

Output:

0.6774193548387096
0.6774193548387096

All of this to say: when you pick a method for this kind of interval analysis, be aware of it's boundary conventions, and use a consistent convention across all your related analyses.

Other solutions

If you assume the data is sorted (or if you sort it yourself), you can use np.searchsorted :

arr = np.random.uniform(-15,15,100)
arr.sort()
np.diff(arr.searchsorted([-10, 10]))[0]/arr.size

Output:

0.65

Setup

a = np.linspace(-15,15,1000)

No builtin method exists, but quite simple to define your own using np.count_nonzero and size . In general:

c = (a > -10) & (a < 10)
np.count_nonzero(c) / a.size

You can wrap this in a function for convenience and to allow for cases where you want closed intervals:

def percent_between(a, lower, upper, closed_left=False, closed_right=False):
    """
    Finds the percentage of values between a range for a numpy array

    Parameters
    ----------
    a: np.ndarray
      numpy array to calculate percentage
    lower: int, float
      lower bound
    upper: int, float
      upper bound
    closed_left:
      closed left bound ( > vs >= )
    closed_right:
      closed right bound ( < vs <= )
    """
    l = np.greater if not closed_left else np.greater_equal
    r = np.less if not closed_right else np.less_equal

    c = l(a, lower) & r(a, upper)
    return np.count_nonzero(c) / a.size

percent_between(a, -10, 10)

0.666

Just to let you guys know I found a very simple solution to this using value_counts and np.inf :

import pandas as pd
import numpy as np

values = pd.Series(np.linspace(0, 100, 200))
values.value_counts(normalize=True, sort=False, bins=[-np.inf, 10, 20, np.inf])

normalize=True returns percentages, setting it to False give the count

sort=False will return in the order of the bins, setting it to True will sort in descending values of the counts

bins define the interval points

This returns

(-inf, 10.0]    0.1
(10.0, 20.0]    0.1
(20.0, inf]     0.8
dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM