简体   繁体   English

熊猫的agg函数为numpy std vs nanstd提供不同的结果

[英]Pandas agg function gives different results for numpy std vs nanstd

I'm converting some numpy code to use pandas DataFrame . 我正在转换一些numpy代码以使用pandas DataFrame The data potentially contains NaN values, so I make use of numpy's nan functions such as nanstd in the original code. 数据可能包含NaN值,因此我在原始代码中使用了numpy的nan函数,例如nanstd I was of the impression that pandas skips NaN values by default, so I switched to using the regular versions of the same functions. 我的印象是熊猫默认情况下会跳过NaN值,因此我改用了相同功能的常规版本。

I want to group the data and compute some statistics on it using agg() , however when I use np.std() I am getting different results to the original code, even in cases where the data doesn't contain any NaNs 我想使用agg()对数据进行分组并计算一些统计数据,但是当我使用np.std()时,即使原始数据不包含任何NaN,我得到的结果也与原始代码不同

Here's a small example demonstrating the problem 这是一个演示问题的小例子

>>> arr = np.array([[1.17136, 1.11816],
                    [1.13096, 1.04134],
                    [1.13865, 1.03414],
                    [1.09053, 0.96330],
                    [1.02455, 0.94728],
                    [1.18182, 1.04950],
                    [1.09620, 1.06686]])

>>> df = pd.DataFrame(arr, 
                      index=['foo']*3 + ['bar']*4, 
                      columns=['A', 'B'])

>>> df
           A        B
foo  1.17136  1.11816
foo  1.13096  1.04134
foo  1.13865  1.03414
bar  1.09053  0.96330
bar  1.02455  0.94728
bar  1.18182  1.04950
bar  1.09620  1.06686

>>> g = df.groupby(df.index)

>>> g['A'].agg([np.mean, np.median, np.std])
         mean    median       std
bar  1.098275  1.093365  0.064497
foo  1.146990  1.138650  0.021452

>>> g['A'].agg([np.mean, np.median, np.nanstd])
         mean    median    nanstd
bar  1.098275  1.093365  0.055856
foo  1.146990  1.138650  0.017516

If I compute the std values with the numpy functions, I get the expected result in both cases. 如果我使用numpy函数计算std值,则在两种情况下都可以获得预期的结果。 What's going on inside the agg() function? agg()函数内部发生了什么?

>>> np.std(df.loc['foo', 'A'])
0.01751583474079002
>>> np.nanstd(df.loc['foo', 'A'])
0.017515834740790021

Edit: 编辑:

As mentioned in the answer linked by Vivek Harikrishnan, pandas uses a different method to compute the std. 如Vivek Harikrishnan所链接的答案中所述,熊猫使用另一种方法来计算std。 This seems to match my results 这似乎与我的结果相符

>>> g['A'].agg(['mean', 'median', 'std'])
         mean    median       std
bar  1.098275  1.093365  0.064497
foo  1.146990  1.138650  0.021452

And if I specify a lambda that calls np.std() I get the expected result 如果我指定一个调用np.std()的lambda,我将得到预期的结果

>>> g['A'].agg([np.mean, np.median, lambda x: np.std(x)])
         mean    median  <lambda>
bar  1.098275  1.093365  0.055856
foo  1.146990  1.138650  0.017516

This suggests that the pandas functions are being called instead when I write g['A'].agg([np.mean, np.median, np.std]) . 这表明当我编写g['A'].agg([np.mean, np.median, np.std])时,将调用pandas函数。 The question is why does this happen when I explicitly tell it to use the numpy functions? 问题是为什么当我明确告诉它使用numpy函数时会发生这种情况?

It seems that Pandas either replaces np.std in the .agg([np.mean, np.median, np.std]) call with the built-in Pandas Series.std() method or calls np.std(series, ddof=1) : 看来,无论是熊猫替换np.std.agg([np.mean, np.median, np.std])与内置的熊猫叫Series.std()方法或者调用np.std(series, ddof=1)

In [337]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x)])
Out[337]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.055856
foo  1.146990  1.138650  0.021452  0.017516

NOTE: pay attention that np.std and lambda x: np.std(x) producing different results. 注意:请注意np.stdlambda x: np.std(x)产生不同的结果。

if we specify ddof=1 (Pandas default) explicitly, then we will have the same result: 如果我们明确指定ddof=1 (默认为熊猫),那么我们将得到相同的结果:

In [338]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x, ddof=1)])
Out[338]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452

using built-in 'std' produces the same result: 使用内置的'std'产生相同的结果:

In [341]: g['A'].agg([np.mean, np.median, 'std', lambda x: np.std(x, ddof=1)])
Out[341]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452

The second rule of Python Zen says it all: Python Zen的第二条规则说明了一切:

In [340]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.  # <----------- NOTE !!!
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM