具有张量流的线性回归

Question

I trying to understand linear regression... here is script that I tried to understand: 我试图理解线性回归......这是我试图理解的脚本：

'''
A linear regression learning algorithm example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''

from __future__ import print_function

import tensorflow as tf
from numpy import *
import numpy
import matplotlib.pyplot as plt
rng = numpy.random

# Parameters
learning_rate = 0.0001
training_epochs = 1000
display_step = 50

# Training Data
train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
                         7.042,10.791,5.313,7.997,5.654,9.27,3.1])
train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
                         2.827,3.465,1.65,2.904,2.42,2.94,1.3])

train_X=numpy.asarray(train_X)
train_Y=numpy.asarray(train_Y)
n_samples = train_X.shape[0]


# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")

# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")

# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)


# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    # Fit all training data
    for epoch in range(training_epochs):
        for (x, y) in zip(train_X, train_Y):
            sess.run(optimizer, feed_dict={X: x, Y: y})

        # Display logs per epoch step
        if (epoch+1) % display_step == 0:
            c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
            print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
                "W=", sess.run(W), "b=", sess.run(b))

    print("Optimization Finished!")
    training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
    print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')

    # Graphic display
    plt.plot(train_X, train_Y, 'ro', label='Original data')
    plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
    plt.legend()
    plt.show()

Question is what this part represent: 问题是这部分代表的内容：

# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")

And why are there random float numbers? 为什么有随机浮点数？

Also could you show me some math with formals represents cost, pred, optimizer variables? 你也可以告诉我一些数学与形式代表成本，预测，优化器变量？

Answer 1

let's try to put up some intuition&sources together with the tf approach. 让我们尝试将一些直觉和来源与tf方法结合起来。

General intuition: 一般直觉：

Regression as presented here is a supervised learning problem. 这里介绍的回归是监督学习问题。 In it, as defined in Russel&Norvig's Artificial Intelligence , the task is: 在其中，正如Russel＆Norvig的人工智能中所定义的，任务是：

given a training set (X, y) of m input-output pairs (x1, y1), (x2, y2), ... , (xm, ym) , where each output was generated by an unknown function y = f(x) , discover a function h that approximates the true function f 给定m输入 - 输出对(x1, y1), (x2, y2), ... , (xm, ym)的训练集 (X, y) ，其中每个输出由未知函数y = f(x) ，发现一个近似真实函数 f的函数h

For that sake, the h hypothesis function combines somehow each x with the to-be-learned parameters, in order to have an output that is as close to the corresponding y as possible, and this for the whole dataset. 为此， h 假设函数以某种方式将每个x与待学习参数组合，以便具有尽可能接近相应y的输出，并且这对于整个数据集。 The hope is that the resulting function will be close to f . 希望得到的函数接近于f 。

But how to learn this parameters? 但是如何学习这个参数呢？ in order to be able to learn , the model has to be able to evaluate . 为了能够学习，模型必须能够进行评估。 Here comes the cost (also called loss , energy , merit ...) function to play: it is a metric function that compares the output of h with the corresponding y , and penalizes big differences . 这里有成本（也称为损耗，能量，优点 ......）函数：它是一个度量函数，它将h的输出与相应的y进行比较，并且惩罚大的差异 。

Now it should be clear what is exactly the "learning" process here: alter the parameters in order to achieve a lower value for the cost function . 现在应该清楚这里的“学习”过程到底是什么： 改变参数以便为成本函数实现更低的值 。

Linear Regression: 线性回归：

The example that you are posting performs a parametric linear regression , optimized with gradient descent based on the mean squared error as cost function. 您发布的示例执行参数线性回归 ，使用基于作为成本函数的均方误差的 梯度下降进行优化。 Which means: 意思是：

Parametric : The set of parameters is fixed . 参数：参数集是固定的 。 They are held in the exact same memory placeholders thorough the learning process. 它们通过学习过程保存在完全相同的内存占位符中。
Linear : The output of h is merely a linear (actually, affine ) combination between the input x and your parameters. 线性： h的输出仅仅是输入x和参数之间的线性（实际上是仿射）组合。 So if x and w are real-valued vectors of the same dimensionality, and b is a real number, it holds that h(x,w, b)= w.transposed()*x+b . 因此，如果x和w是相同维度的实值向量，并且b是实数，则认为h(x,w, b)= w.transposed()*x+b 。 Page 107 of the Deep Learning Book brings more quality insights and intuitions into that. 深度学习书的第107页为此带来了更多质量见解和直觉。
Cost function : Now this is the interesting part. 成本函数 ：现在这是有趣的部分。 The average squared error is a convex function. 平均误差是凸函数。 This means it has a single, global optimum, and furthermore, it can be directly found with the set of normal equations (also explained in the DLB). 这意味着它具有单一的全局最优，此外，它可以直接找到一组正规方程 （也在DLB中解释）。 In the case of your example, the stochastic (and/or minibatch) gradient descent method is used: this is the preferred method when optimizing non-convex cost functions (which is the case in more advanced models like neural networks) or when your dataset has a huge dimensionality (also explained in the DLB). 在您的示例中，使用随机（和/或小批量）梯度下降方法：这是优化非凸成本函数（在神经网络等更高级模型中就是这种情况）或数据集时的首选方法具有巨大的维度（也在DLB中解释）。
Gradient descent : tf deals with this for you, so it is enough to say that GD minimizes the cost function by following its derivative "downwards", in small steps, until reaching a saddle point. 梯度下降 ： tf为您处理此问题，因此足以说GD通过以小步骤跟随其衍生“向下”直到达到鞍点来最小化成本函数。 If you totally need to know, the exact technique applied by TF is called automatic differentiation , kind of a compromise between the numeric and symbolic approaches . 如果您完全需要知道，TF应用的确切技术称为自动微分，即数字和符号方法之间的折衷 。 For convex functions like yours this point will be the global optimum, and (if your learning rate is not too big) it will always converge to it, so it doesn't matter which values you initialize your Variables with . 对于像你这样的凸函数，这一点将是全局最优，并且（如果你的学习速度不是太大）它将始终收敛到它，所以你初始化变量的哪个值无关紧要 。 The random initialization is necessary in more complex architectures like neural networks. 随机初始化在更复杂的架构（如神经网络）中是必需的。 There is some extra code regarding the management of the minibatches , but I won't get into that because it is not the main focus of your question. 关于迷你游戏的管理有一些额外的代码，但我不会深入研究，因为它不是你问题的主要焦点。

The TensorFlow approach: TensorFlow方法：

Deep Learning frameworks are nowadays about nesting lots of functions by building computational graphs (you may want to take a look at the presentation on DL frameworks that I did some weeks ago). 深度学习框架现在通过构建计算图来嵌套大量函数（您可能想看看我几周前做过的关于DL框架的演示）。 For constructing and running the graph, TensoFlow follows a declarative style , which means that the graph has to be first completely defined and compiled, before it is deployed and executed. 为了构造和运行图形，TensoFlow遵循声明式样式，这意味着在部署和执行图形之前必须首先完全定义和编译图形。 It is very reccommended to read this short wiki article, if you haven't yet. 如果您还没有，请阅读这篇简短的wiki文章。 In this context, the setup is split in two parts: 在此上下文中，设置分为两部分：

Firstly, you define your computational Graph , where you put your dataset and parameters in memory placeholders, define the hypothesis and cost functions building on them, and tell tf which optimization technique to apply. 首先，定义计算图形，将数据集和参数放在内存占位符中，定义基于它们的假设和成本函数，并告诉tf哪种优化技术。
Then you run the computation in a Session and the library will be able to (re)load the data placeholders and perform the optimization. 然后在会话中运行计算，库将能够（重新）加载数据占位符并执行优化。

The code: 编码：

The code of the example follows this approach closely: 该示例的代码紧跟这种方法：

Define the test data X and labels Y , and prepare a placeholder in the Graph for them (which is fed in the feed_dict part). 定义测试数据X并标记Y ，并在图中为它们准备一个占位符（在feed_dict部分中输入）。
Define the 'W' and 'b' placeholders for the parameters. 为参数定义'W'和'b'占位符。 They have to be Variables because they will be updated during the Session. 它们必须是变量，因为它们将在会话期间更新。
Define pred (our hypothesis ) and cost as explained before. 如前所述，定义pred （我们的假设）和cost 。

From this, the rest of the code should be clearer. 由此，其余代码应该更清晰。 Regarding the optimizer, as I said, tf already knows how to deal with this but you may want to look into gradient descent for more details (again, the DLB is a pretty good reference for that) 关于优化器，正如我所说， tf已经知道如何处理这个，但你可能想要了解更多细节的梯度下降（同样，DLB是一个非常好的参考）

Cheers! 干杯! Andres 安德烈斯

CODE EXAMPLES: GRADIENT DESCENT VS. 代码示例：GRADIENT DESCENT VS. NORMAL EQUATIONS 正态方程

This small snippets generate simple multi-dimensional datasets and test both approaches. 这个小片段生成简单的多维数据集并测试这两种方法。 Notice that the normal equations approach doesn't require looping, and brings better results. 请注意， 正规方程方法不需要循环，并带来更好的结果。 For small dimensionality (DIMENSIONS<30k) is probably the preferred approach: 对于小维度（DIMENSIONS <30k）可能是首选方法：

from __future__ import absolute_import, division, print_function
import numpy as np
import tensorflow as tf

####################################################################################################
### GLOBALS
####################################################################################################
DIMENSIONS = 5
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise

####################################################################################################
### GRADIENT DESCENT APPROACH
####################################################################################################
# dataset globals
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset is used for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
ALPHA = 1e-8 # learning rate
LAMBDA = 0.5 # L2 regularization factor
TRAINING_STEPS = 1000

# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)] # synthesize data
# ds = normalize_data(ds)
ds = [(x, [f(x)+noise()]) for x in ds] # add labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])

# define the computational graph
graph = tf.Graph()
with graph.as_default():
  # declare graph inputs
  x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS))
  y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
  x_test = tf.placeholder(tf.float32, shape=(_test_size, DIMENSIONS))
  y_test = tf.placeholder(tf.float32, shape=(_test_size, 1))
  theta = tf.Variable([[0.0] for _ in range(DIMENSIONS)])
  theta_0 = tf.Variable([[0.0]]) # don't forget the bias term!
  # forward propagation
  train_prediction = tf.matmul(x_train, theta)+theta_0
  test_prediction  = tf.matmul(x_test, theta) +theta_0
  # cost function and optimizer
  train_cost = (tf.nn.l2_loss(train_prediction - y_train)+LAMBDA*tf.nn.l2_loss(theta))/float(_train_size)
  optimizer = tf.train.GradientDescentOptimizer(ALPHA).minimize(train_cost)
  # test results
  test_cost = (tf.nn.l2_loss(test_prediction - y_test)+LAMBDA*tf.nn.l2_loss(theta))/float(_test_size)

# run the computation
with tf.Session(graph=graph) as s:
  tf.initialize_all_variables().run()
  print("initialized"); print(theta.eval())
  for step in range(TRAINING_STEPS):
    _, train_c, test_c = s.run([optimizer, train_cost, test_cost],
                               feed_dict={x_train: train_data, y_train: train_labels,
                                          x_test: test_data, y_test: test_labels })
    if (step%100==0):
      # it should return bias close to zero and parameters all close to 1 (see definition of f)
      print("\nAfter", step, "iterations:")
      #print("   Bias =", theta_0.eval(), ", Weights = ", theta.eval())
      print("   train cost =", train_c); print("   test cost =", test_c)
  PARAMETERS_GRADDESC = tf.concat(0, [theta_0, theta]).eval()
  print("Solution for parameters:\n", PARAMETERS_GRADDESC)

####################################################################################################
### NORMAL EQUATIONS APPROACH
####################################################################################################
# dataset globals
DIMENSIONS = 5
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset isused for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
# training globals
LAMBDA = 1e6 # L2 regularization factor

# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])

# define the computational graph
graph = tf.Graph()
with graph.as_default():
  # declare graph inputs
  x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS+1))
  y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
  theta = tf.Variable([[0.0] for _ in range(DIMENSIONS+1)]) # implicit bias!
  # optimum
  optimum = tf.matrix_solve_ls(x_train, y_train, LAMBDA, fast=True)

# run the computation: no loop needed!
with tf.Session(graph=graph) as s:
  tf.initialize_all_variables().run()
  print("initialized")
  opt = s.run(optimum, feed_dict={x_train:train_data, y_train:train_labels})
  PARAMETERS_NORMEQ = opt
  print("Solution for parameters:\n",PARAMETERS_NORMEQ)

####################################################################################################
### PREDICTION AND ERROR RATE
####################################################################################################

# generate test dataset
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
test_data, test_labels = zip(*ds)
# define hypothesis
h_gd = lambda(x): PARAMETERS_GRADDESC.T.dot(x)
h_ne = lambda(x): PARAMETERS_NORMEQ.T.dot(x)
# define cost
mse = lambda pred, lab: ((pred-np.array(lab))**2).sum()/DS_SIZE
# make predictions!
predictions_gd = np.array([h_gd(x) for x in test_data])
predictions_ne = np.array([h_ne(x) for x in test_data])
# calculate and print total error
cost_gd = mse(predictions_gd, test_labels)
cost_ne = mse(predictions_ne, test_labels)
print("total cost with gradient descent:", cost_gd)
print("total cost with normal equations:", cost_ne)

Answer 2

Variables allow us to add trainable parameters to a graph. 变量允许我们将可训练参数添加到图形中。 They are constructed with a type and initial value: 它们使用类型和初始值构造：

W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b

The variable with type tf.Variable is the parameter which we will learn use TensorFlow. 类型为tf.Variable的变量是我们将学习使用TensorFlow的参数。 Assume you use the gradient descent to minimize the loss function. 假设您使用gradient descent来最小化损失函数。 You need initial these parameter first. 您首先需要初始化这些参数。 The rng.randn() is used to generate a random value for this purpose. rng.randn()用于为此目的生成随机值。

I think the Getting Started With TensorFlow is a good start point for you. 我认为TensorFlow入门是一个很好的起点。

Answer 3

I'll first define the variables: 我首先要定义变量：

W is a multidimensional line that spans R^d (same dimensionality as X)
b is a scalar value (bias) 
Y is also a scalar value i.e. the value at X

pred = W (dot) X + b   # dot here refers to dot product

# cost equals the average squared error
cost = ((pred - Y)^2) / 2*num_samples

#finally optimizer
# optimizer computes the gradient with respect to each variable and the update

W += learning_rate * (pred - Y)/num_samples * X 
b += learning_rate * (pred - Y)/num_samples

Why are W and b set to random well this updates based on gradients from the error calculated from the cost so W and b could have been initialized to anything. 为什么W和b设置为随机，这基于从成本计算的误差的梯度更新，因此W和b可以初始化为任何东西。 It isn't performing linear regression via least squares method although both will converge to the same solution. 它不是通过最小二乘法进行线性回归，尽管两者都会收敛到同一解。

Look here for more information: Getting Started 在此处查看更多信息：入门

具有张量流的线性回归

问题描述

3 个解决方案

解决方案1
8 已采纳 2017-04-02 16:18:22

General intuition: 一般直觉：

Linear Regression: 线性回归：

The TensorFlow approach: TensorFlow方法：

The code: 编码：

CODE EXAMPLES: GRADIENT DESCENT VS. 代码示例：GRADIENT DESCENT VS. NORMAL EQUATIONS 正态方程

解决方案2
0 2017-04-02 15:35:37

解决方案3
0 2017-04-02 15:41:34

具有张量流的线性回归

问题描述

3 个解决方案

解决方案1 8 已采纳 2017-04-02 16:18:22

General intuition: 一般直觉：

Linear Regression: 线性回归：

The TensorFlow approach: TensorFlow方法：

The code: 编码：

CODE EXAMPLES: GRADIENT DESCENT VS. 代码示例：GRADIENT DESCENT VS. NORMAL EQUATIONS 正态方程

解决方案2 0 2017-04-02 15:35:37

解决方案3 0 2017-04-02 15:41:34

解决方案1
8 已采纳 2017-04-02 16:18:22

解决方案2
0 2017-04-02 15:35:37

解决方案3
0 2017-04-02 15:41:34