简体   繁体   中英

How calculate OLS regression with Survey Weights in Python.

I want to do a linear regression on survey data with survey weights.

The survey data is from the EU and each observation has a weight. (.4 for an one respondent, 1.5 for another.)

This weight is described as:

"The European Weight, variable 6, produces a representative sample of the European Community as a whole when used in analysis. This variable adjusts the size of each national sample according to each nation's contribution to the population of the European Community."

To do my calculation I'm using sklearn.

from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X,y, sample_weight = weights)

X is a pandas DataFrame. y is a numpy.ndarray. weights is a pandas Series.

Am I using 'sample_weight' correctly, does is this the correct way to handle survey weights in scikit?

TL DR; Yes.

Here is a very simple example of it working,

import numpy as np
import matplotlib.pylab as plt
from sklearn import linear_model
regr = linear_model.LinearRegression()

X = np.array([1, 2, 4]).reshape(-1, 1)
y = np.array([10, 20, 60]).reshape(-1, 1)
weights = np.array([1, 1, 1])

def weighted_lr(X, y, weights):
    """Quick function to run weighted linear regression and return a
    plot and some predictions"""

    regr.fit(X,y, sample_weight=weights)
    y_pred = regr.predict(X)
    plt.scatter(X, y)
    plt.plot(X, y_pred)
    plt.title('Weights: %s' % ', '.join(str(i) for i in weights))
    plt.show()
    return y_pred

y_pred = weighted_lr(X, y, weights)
print(y_pred)

weights = np.array([1000, 1000, 1])
y_pred = weighted_lr(X, y, weights)

print(y_pred)

在此输入图像描述

[[  7.14285714]
 [ 24.28571429]
 [ 58.57142857]]

在此输入图像描述

[[  9.96051333]
 [ 20.05923001]
 [ 40.25666338]]

On the first linear regression model with even weights we see the model behave as expected from a normal linear regression model.

Next, however, we see that in the second model, with low weighing on the last value, almost ignores this last value. The majority of the training has been weighted to the other two values here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM