简体   繁体   中英

Why does scipy.stats.nanmean give different result from numpy.nansum?

>>> import numpy as np
>>> from scipy import stats
>>> a = np.r_[1., 2., np.nan, 4., 5.]
>>> stats.nanmean(a)
2.9999999999999996
>>> np.nansum(a)/np.sum(~np.isnan(a))
3.0

I'm aware of the limitation of floating point representation. Just curious why the more clumsy expression seems to give "better" result.

First of all, here is scipy.nanmean() so that we know what we're comparing to:

def nanmean(x, axis=0):
    x, axis = _chk_asarray(x,axis)
    x = x.copy()
    Norig = x.shape[axis]
    factor = 1.0-np.sum(np.isnan(x),axis)*1.0/Norig

    x[np.isnan(x)] = 0
    return np.mean(x,axis)/factor

Mathematically, the two methods are equivalent. Numerically, they are different.

Your method involves a single division, and it so happens that:

  • the numerator ( 1. + 2. + 4. + 5. ) can be represented exactly as a float ; and
  • the denominator ( 4. ) is a power of two.

This means that the result of the division is exact, 3. .

stats.nanmean() involves first computing the mean of [1., 2., 0., 4., 5.] , and then adjusting it to account for NaNs . As it happens, this mean ( 2.4 ) cannot be represented exactly as a float , so from this point on the computation is inexact.

I haven't given it a lot of thought, but it may be possible to construct an example where the roles would be reversed, and stats.nanmean() would give a more accurate result than the other method.

What surprises me is that stats.nanmean() doesn't simply do something like:

In [6]: np.mean(np.ma.MaskedArray(a, np.isnan(a)))
Out[6]: 3.0

This seems to me to be a superior approach to what it does currently.

The answer is in the code of stats.nanmean :

x, axis = _chk_asarray(x,axis)
x = x.copy()
Norig = x.shape[axis]
factor = 1.0-np.sum(np.isnan(x),axis)*1.0/Norig
x[np.isnan(x)] = 0
return np.mean(x,axis)/factor

I believe it has something to do with the 1.0 - np.sum , a substraction of the sum.

As @eumiro mentioned, stats.nanmean calculated the mean in a circumlocutions way different from the straightforward one liner way you did

From the same reference code,

np.sum(np.isnan(x),axis) returns numpy.int32 which when multiplied by * 1.0 , results a floating point approximation as opposed to what it would have got when the result would have been integer resulting in a difference in result

>>> numpy.int32(1)*1.0/5
0.20000000000000001
>>> int(numpy.int32(1))*1.0/5
0.2
>>> type(np.sum(np.isnan(x),axis))
<type 'numpy.int32'>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM