简体   繁体   中英

Python: Pearson's r

So this is my code for calculating the correlation between two variables using pearson's r.

def correlation(x, y):
    std_x = (x - x.mean()) / x.std(ddof=0)
    std_y = (y - y.mean()) / y.std(ddof=0)

    return (std_x * std_y).mean()

I understand that in order to do so, one needs to:

  1. Standardize each variable
  2. Multiply each pair of values, and take the average

Which brings me to my question, why is the below line used:

std_x = (x - x.mean()) / x.std(ddof=0)

instead of simply:

r = (x.std(ddof=0)*y.std(ddof=0))/len(x)

I think you get confused on the formula of Pearson's coefficient. Say you have two random variables X and Y. Then Pearson's coefficient is defined as

r = Cov(X, Y)/(s_X*s_Y)

Where Cov(X, Y) is the covariance between X and Y, and s_Y and s_Y their standard deviation.

Cov(X, Y) = E[(XE[X])*(Y - E[Y])]

Where E[Z] designs the expected value of the random variable Z .

Ok, now we have the formula, so how to compute that. Actually you can't since you don't have access to the real standard deviations and real expected values. Instead, what we usually do is compute the sample correlation coefficient, which is based on this formula but replacing the real values by the values given by estimators.

A natural (minimum variance non biased) estimator for the expected values in the formula, is simply the mean (given by np.mean ), and samewise, the right estimator for the standard deviation is the empirical standard deviation given by np.std .

So putting it all together, the formula would become

r = np.mean((x-np.mean(x))*(y-np.mean(y)))/(np.std(x)*np.std(y))

which is actually the same as

np.mean(X*Y)

where X = (x-np.mean(x))/np.std(x) and Y = (y-np.mean(y))/np.std(y)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM