What algorithm does Pandas use for computing variance?

Question

Which method does Pandas use for computing the variance of a Series?

For example, using Pandas (v0.14.1):

pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145

Obviously due to some numeric instability. However, in R we get:

var(rep(500111,2000000))
0

I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses. This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Update : To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by @Jeff); whereas if it is not installed, the naive implementation indicated by @BrenBarn is used.

Answer 1

The algorithm can be seen in nanops.py , in the function nanvar , the last line of which is:

return np.fabs((XX - X ** 2 / count) / d)

This is the "naive" implementation at the beginning of the Wikipedia article you mention. ( d will be set to N-1 in the default case.)

The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.

Answer 2

I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.

np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999

np.var(repeat(100000000,100000))
0.0

Using Pandas 0.11.0.

What algorithm does Pandas use for computing variance?

Question

2 answers

solution1
3 ACCPTED 2014-08-12 19:44:12

solution2
1 2014-08-12 19:40:32

What algorithm does Pandas use for computing variance?

Question

2 answers

solution1 3 ACCPTED 2014-08-12 19:44:12

solution2 1 2014-08-12 19:40:32

solution1
3 ACCPTED 2014-08-12 19:44:12

solution2
1 2014-08-12 19:40:32