What is a meaning of output of numpy's histogram function when density is True?

Question

I don't understand the output of numpy's histogram function when density is True. When I do this:

hist = np.histogram(np.array([1,1, 2,3,4]), 4, density = False)
print("histogram: ", hist)

output is:

histogram:  (array([2, 1, 1, 1]), array([1.  , 1.75, 2.5 , 3.25, 4.  ]))

It is clear to me. I crated 4 intervals array([1. , 1.75, 2.5, 3.25, 4. ]) and array([2, 1, 1, 1] are numbers of elements in each interval. But when I do it with density = True:

hist = np.histogram(np.array([1,1, 2,3,4]), 4, density = True)
print("histogram: ", hist)

result is:

histogram:  (array([0.53333333, 0.26666667, 0.26666667, 0.26666667]), array([1.  , 1.75, 2.5 , 3.25, 4.  ]))

I dint understand what are those numbers array([0.53333333, 0.26666667, 0.26666667, 0.26666667]). Documentation says that it is probability density function, but sum of PDF use be 1, so its not percentage of each element type. My question is, how those numbers are calculated? Could you explain on my given example?

Answer 1

A clue is given in the second half of the documentation paragraph you read:

density: bool, optional

If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

The values you're seeing are the density value of the PDF at each bin. Those dont have a requirement of summing to 1 because they are not masses (as stated in the docs), but rather it is the area under the PDF that is equal to 1.

In your case, you can see that the area under the PDF, which is a function of bin height (density) and bin width, is equal to 1 by summing the product of height and width for each bin:

(0.53*0.75) + (0.27*0.75) + (0.27*0.75) + (0.27*0.75) = 1 (with some rounding error)

EDIT:

Regarding how those density values are calculated, you can see that in the numpy source :

if density:
    db = np.array(np.diff(bin_edges), float)
    return n/db/n.sum(), bin_edges

Where n is the array of histogram values, and db contains the bin widths

So, in your specific example, your first histogram value is 2, which converts to a density value of 0.53 in the following way:

2 / 0.75 / (2 + 1 + 1 + 1) = 0.53

What is a meaning of output of numpy's histogram function when density is True?

Question

1 answers

solution1
1 2021-02-23 14:11:11

What is a meaning of output of numpy's histogram function when density is True?

Question

1 answers

solution1 1 2021-02-23 14:11:11

solution1
1 2021-02-23 14:11:11