简体   繁体   中英

Pandas agg function gives different results for numpy std vs nanstd

I'm converting some numpy code to use pandas DataFrame . The data potentially contains NaN values, so I make use of numpy's nan functions such as nanstd in the original code. I was of the impression that pandas skips NaN values by default, so I switched to using the regular versions of the same functions.

I want to group the data and compute some statistics on it using agg() , however when I use np.std() I am getting different results to the original code, even in cases where the data doesn't contain any NaNs

Here's a small example demonstrating the problem

>>> arr = np.array([[1.17136, 1.11816],
                    [1.13096, 1.04134],
                    [1.13865, 1.03414],
                    [1.09053, 0.96330],
                    [1.02455, 0.94728],
                    [1.18182, 1.04950],
                    [1.09620, 1.06686]])

>>> df = pd.DataFrame(arr, 
                      index=['foo']*3 + ['bar']*4, 
                      columns=['A', 'B'])

>>> df
           A        B
foo  1.17136  1.11816
foo  1.13096  1.04134
foo  1.13865  1.03414
bar  1.09053  0.96330
bar  1.02455  0.94728
bar  1.18182  1.04950
bar  1.09620  1.06686

>>> g = df.groupby(df.index)

>>> g['A'].agg([np.mean, np.median, np.std])
         mean    median       std
bar  1.098275  1.093365  0.064497
foo  1.146990  1.138650  0.021452

>>> g['A'].agg([np.mean, np.median, np.nanstd])
         mean    median    nanstd
bar  1.098275  1.093365  0.055856
foo  1.146990  1.138650  0.017516

If I compute the std values with the numpy functions, I get the expected result in both cases. What's going on inside the agg() function?

>>> np.std(df.loc['foo', 'A'])
0.01751583474079002
>>> np.nanstd(df.loc['foo', 'A'])
0.017515834740790021

Edit:

As mentioned in the answer linked by Vivek Harikrishnan, pandas uses a different method to compute the std. This seems to match my results

>>> g['A'].agg(['mean', 'median', 'std'])
         mean    median       std
bar  1.098275  1.093365  0.064497
foo  1.146990  1.138650  0.021452

And if I specify a lambda that calls np.std() I get the expected result

>>> g['A'].agg([np.mean, np.median, lambda x: np.std(x)])
         mean    median  <lambda>
bar  1.098275  1.093365  0.055856
foo  1.146990  1.138650  0.017516

This suggests that the pandas functions are being called instead when I write g['A'].agg([np.mean, np.median, np.std]) . The question is why does this happen when I explicitly tell it to use the numpy functions?

It seems that Pandas either replaces np.std in the .agg([np.mean, np.median, np.std]) call with the built-in Pandas Series.std() method or calls np.std(series, ddof=1) :

In [337]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x)])
Out[337]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.055856
foo  1.146990  1.138650  0.021452  0.017516

NOTE: pay attention that np.std and lambda x: np.std(x) producing different results.

if we specify ddof=1 (Pandas default) explicitly, then we will have the same result:

In [338]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x, ddof=1)])
Out[338]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452

using built-in 'std' produces the same result:

In [341]: g['A'].agg([np.mean, np.median, 'std', lambda x: np.std(x, ddof=1)])
Out[341]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452

The second rule of Python Zen says it all:

In [340]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.  # <----------- NOTE !!!
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM