简体   繁体   English

使用 TensorFlow 的多元线性回归

[英]Multiple Linear Regression with TensorFlow

I'm trying to perform a Multiple Linear Regression with TensorFlow and confront the results with statsmodels library.我正在尝试使用 TensorFlow 执行多元线性回归,并使用statsmodels库处理结果。

I generated two random variables X1 and X2 (so that anyone can reproduce it) that will explain the Y variable.我生成了两个随机变量X1X2 (以便任何人都可以重现它)来解释 Y 变量。 The X2 variable is completely useless for this regression, it's just noise with a big scale so that the coefficient will result not significant (p-val close to 1). X2变量对于这个回归完全没有用,它只是一个大尺度的噪声,因此系数不会显着(p-val 接近 1)。 At the end I should obtain a model that is basically.最后我应该得到一个基本上是 model 的。 y_data = alpha + (0.25)x1 + (0.00)x2 + error.

I tried to adapt this code to my randomly generated data, but unfortunately, this is not working at all.我试图使这段代码适应我随机生成的数据,但不幸的是,这根本不起作用。 This below is my try:下面是我的尝试:

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow import keras
import datetime


#generating variables:
np.random.seed(1)
lin_x = np.arange(0,200,2)
y_data = np.true_divide(lin_x,4)
n = np.shape(lin_x)
##adding noise:
rand1 = norm.rvs(loc=0,scale=5,size=n)
np.random.seed(2)
rand2 = norm.rvs(loc=0,scale=1000,size=n)
x1 = np.add(lin_x,rand1)
x2 = rand2

#creating the X matrix: beta = (X'X)^-1(X'y):
x_data = np.column_stack((x1,x2))
#adding ones vector for the intercept:
x_data = sm.add_constant(x_data)

#MLR with statsmodels:
mod = sm.OLS(y_data,x_data)
LinReg = mod.fit()
print(LinReg.summary())



#MLR with tensorflow:
normalizer = preprocessing.Normalization()
normalizer.adapt(x_data)

normalized_data = normalizer(x_data)

print(normalized_data)


model = tf.keras.Sequential([
  normalizer,
  layers.Dense(units=1)
])

model.compile(loss = tf.losses.MeanSquaredError(),
              optimizer = tf.keras.optimizers.SGD(
                learning_rate=0.06, momentum=0.0, nesterov=True, name="SGD",
              ))

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

model.summary()
print('--------------')
weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]
print('--------------')

x_data_tf = tf.convert_to_tensor(x_data)
y_data_tf = tf.convert_to_tensor(y_data)
model.fit(y_data_tf,x_data_tf, epochs=1000, callbacks=[tensorboard_callback])

weights = model.layers[0].get_weights()[0]
biases = model.layers[0].get_weights()[1]

print("TensorFlow results: ")
print("weigths: ", weights)
print("biases: ", biases)

print(LinReg.summary())

How can I get the same coefficients that I've obtained with the statsmodels library using TensorFlow?如何获得与使用 TensorFlow 的statsmodels库获得的系数相同的系数?

The key issues with your code are the following:您的代码的关键问题如下:

  1. While it is necessary to add a column of ones to the features matrix x_data before running the regression with statsmodels, this is not necessary when running the regression with tensorflow.虽然有必要在使用 statsmodels 运行回归之前向特征矩阵x_data添加一列,但在使用 tensorflow 运行回归时,这不是必需的。 This means that you are passing 3 features to tensorflow instead of 2, where the additional feature (the first column of x_data ) is constant.这意味着您将 3 个功能传递给 tensorflow 而不是 2 个,其中附加功能( x_data的第一列)是恒定的。
  2. You are normalizing x_data after the first column of ones has already been added with x_data = sm.add_constant(x_data) .x_data = sm.add_constant(x_data)添加第一列之后,您正在规范化x_data As a column of ones has zero variance, after normalization you get a column of nan (as you are dividing by zero).由于一列的方差为零,因此在归一化后,您会得到一列nan (因为您除以零)。 This means that the first of the 3 features that you are passing to tensorflow is completely missing (ie it's always nan ).这意味着您传递给 tensorflow 的 3 个功能中的第一个完全丢失(即始终为nan )。
  3. While statsmodels takes as inputs at first y and then X , tensorflow takes as inputs at first X and then y .虽然 statsmodels 在第一个yX处作为输入,但 tensorflow 在第一个Xy处作为输入。 This means that you have switched the features and target when running the regression in tensorflow.这意味着您在 tensorflow 中运行回归时已经切换了特征和目标。

I included a complete example below.我在下面包含了一个完整的示例。

import numpy as np
import statsmodels.api as sm
import tensorflow as tf

np.random.seed(0)

N = 500    # number of samples
K = 2      # number of features

# generate the features
m = np.array([0, 0])     # means
s = np.array([5, 1000])  # variances
X = np.random.multivariate_normal(m * np.ones(K), s * np.eye(K), N)

# generate the target
b = 0.5                   # bias
w = np.array([0.107, 0])  # weights
y = b + np.dot(X, w) + np.random.normal(0, 1, N)

# normalize the features
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0, ddof=1)

# run a linear regression with statsmodels
reg = sm.OLS(y, sm.add_constant(X)).fit()

# run a linear regression with tensorflow
model = tf.keras.Sequential([
  tf.keras.layers.Dense(units=1)
])

model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.SGD(learning_rate=0.01))
model.fit(X, y, epochs=1000, verbose=0)

bias = model.layers[0].get_weights()[1]
weights = model.layers[0].get_weights()[0].flatten()

# compare the parameters
print('Statsmodels parameters:')
print(np.round(reg.params, 3))
# [0.523 0.25 0.063]

print('Tensorflow parameters:')
print(np.round(np.append(bias, weights), 3))
# [0.528 0.25 0.066]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM