简体   繁体   中英

What algorithm does Pandas use for computing variance?

Which method does Pandas use for computing the variance of a Series?

For example, using Pandas (v0.14.1):

pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145

Obviously due to some numeric instability. However, in R we get:

var(rep(500111,2000000))
0

I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses. This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Update : To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by @Jeff); whereas if it is not installed, the naive implementation indicated by @BrenBarn is used.

The algorithm can be seen in nanops.py , in the function nanvar , the last line of which is:

return np.fabs((XX - X ** 2 / count) / d)

This is the "naive" implementation at the beginning of the Wikipedia article you mention. ( d will be set to N-1 in the default case.)

The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.

I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.

np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999

np.var(repeat(100000000,100000))
0.0

Using Pandas 0.11.0.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM