Problems understanding linear regression model tuning in tf.keras

Question

I am working on the Linear Regression with Synthetic Data Colab exercise , which explores linear regression with a toy dataset. There is a linear regression model built and trained and one can play around with the learning rate, the epoch and the batch size. I have troubles understanding how exactly the iterations are done and how this connects to the "epoch" and the "batch size". I am basically not getting how the actual model is trained, how data is processed and iterations are done. To understand this I wanted to follow this by calculating each step manually. Therefore I wanted to have the slope and intercept coefficient for each step. So that I can see what kind of data the "computer" uses, puts into the model, what kind of model results at each specific iteration and how iterations are done. I tried first to get the slope and intercept for each single step, however failed, because only at the end the slope and intercept is outputted. My modified code (original, just added:)

  print("Slope")
  print(trained_weight)
  print("Intercept")
  print(trained_bias)

code:

import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt

#@title Define the functions that build and train a model
def build_model(my_learning_rate):
  """Create and compile a simple linear regression model."""
  # Most simple tf.keras models are sequential. 
  # A sequential model contains one or more layers.
  model = tf.keras.models.Sequential()

  # Describe the topography of the model.
  # The topography of a simple linear regression model
  # is a single node in a single layer. 
  model.add(tf.keras.layers.Dense(units=1, 
                                  input_shape=(1,)))

  # Compile the model topography into code that 
  # TensorFlow can efficiently execute. Configure 
  # training to minimize the model's mean squared error. 
  model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.RootMeanSquaredError()])
 
  return model           


def train_model(model, feature, label, epochs, batch_size):
  """Train the model by feeding it data."""

  # Feed the feature values and the label values to the 
  # model. The model will train for the specified number 
  # of epochs, gradually learning how the feature values
  # relate to the label values. 
  history = model.fit(x=feature,
                      y=label,
                      batch_size=batch_size,
                      epochs=epochs)

  # Gather the trained model's weight and bias.
  trained_weight = model.get_weights()[0]
  trained_bias = model.get_weights()[1]
  print("Slope")
  print(trained_weight)
  print("Intercept")
  print(trained_bias)
  # The list of epochs is stored separately from the 
  # rest of history.
  epochs = history.epoch

  # Gather the history (a snapshot) of each epoch.
  hist = pd.DataFrame(history.history)

 # print(hist)
  # Specifically gather the model's root mean 
  #squared error at each epoch. 
  rmse = hist["root_mean_squared_error"]

  return trained_weight, trained_bias, epochs, rmse

print("Defined create_model and train_model")

#@title Define the plotting functions
def plot_the_model(trained_weight, trained_bias, feature, label):
  """Plot the trained model against the training feature and label."""

  # Label the axes.
  plt.xlabel("feature")
  plt.ylabel("label")

  # Plot the feature values vs. label values.
  plt.scatter(feature, label)

  # Create a red line representing the model. The red line starts
  # at coordinates (x0, y0) and ends at coordinates (x1, y1).
  x0 = 0
  y0 = trained_bias
  x1 = my_feature[-1]
  y1 = trained_bias + (trained_weight * x1)
  plt.plot([x0, x1], [y0, y1], c='r')

  # Render the scatter plot and the red line.
  plt.show()

def plot_the_loss_curve(epochs, rmse):
  """Plot the loss curve, which shows loss vs. epoch."""

  plt.figure()
  plt.xlabel("Epoch")
  plt.ylabel("Root Mean Squared Error")

  plt.plot(epochs, rmse, label="Loss")
  plt.legend()
  plt.ylim([rmse.min()*0.97, rmse.max()])
  plt.show()

print("Defined the plot_the_model and plot_the_loss_curve functions.")

my_feature = ([1.0, 2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0, 10.0, 11.0, 12.0])
my_label   = ([5.0, 8.8,  9.6, 14.2, 18.8, 19.5, 21.4, 26.8, 28.9, 32.0, 33.8, 38.2])

learning_rate=0.05
epochs=1
my_batch_size=12

my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature, 
                                                         my_label, epochs,
                                                         my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)

In my specific case my output was:

Now I tried to replicate this in a simple excel sheet and calculated the rmse manually:

However, I get 21.8 and not 23.1? Also my loss is not 535.48, but 476.82

My first question is therefore: Where is my mistake, how is the rmse calculated?

Second question(s): How can I get the rmse for each specific iteration? Let's consider epoch is 4 and batch size is 4.

That gives 4 epochs and 3 batches with each 4 examples (observations). I don't understand how the model is trained with these iterations. So how can I get the coefficients of each regression model and rmse? Not just for each epoch (so 4), but for each iteration. I think each epoch has 3 iterations. So in total I think 12 linear regression models result? I would like to see these 12 models. What are the initial values used in the starting point when no information is given, what kind of slope and intercept is used? The starting at the really first point. I don't specify this. Then I would like to be able follow how the slope and intercepts are adapted at each step. This will be from the gradient descent algorithm I think. But that would be the super plus. More important for me is first to understand how these iterations are done and how they connect to the epoch and batch.

Update: I know that the initial values (for the slope and intercept) are choosen randomly.

Answer 1

Foundation

Problem statement

Lets consider a linear regression model for a set of samples X where each sample is represented by one feature x . As part of model training, we are searching for the line wx + b such that ((w.x+b) -y )^2 (squared loss) is minimal. For a set of data points we take mean of squared loss for each sample and so called mean squared error (MSE). The w and b which stands for weight and bias are together referred to as weights.

Fitting the line/Training the model

We have a closed form solution for solving the linear regression problem and is (X^TX)^-1.X^Ty
We can also use gradient decent method to search for weights which minimize the squared loss. The frameworks like tensorflow, pytorch use gradient decent to search the weights (called training).

Gradient decent

A gradient decent algorithm for learning regression looks like blow

w, b = some initial value
While model has not converged:
    y_hat = w.X + b
    error = MSE(y, y_hat) 
    back propagate (BPP) error and adjust weights

Each run of the above loop is called an epoch. However due to resource constrains the calculation of y_hat , error and BPP is not preformed on full dataset, instead the data is divided into smaller batches and the above operations are performed on one batch at a time. Also we normally fix the number of epoch and monitor if the model has converged.

w, b = some initial value
for i in range(number_of_epochs)
    for X_batch,y_batch in get_next_batch(X, y)
        y_hat = w.X_batch + b
        error = MSE(y_batch, y_hat) 
    back propagate (BPP) error and adjust weights

Keras implementation of batches

Lets say we would like to add root mean squared error for tracing the model performance while it is training. The way Keras implements is as below

w, b = some initial value
for i in range(number_of_epochs)
    all_y_hats = []
    all_ys = []
    for X_batch,y_batch in get_next_batch(X, y)
        y_hat = w.X_batch + b
        error = MSE(y_batch, y_hat)

        all_y_hats.extend(y_hat) 
        all_ys.extend(y_batch)

        batch_rms_error = RMSE(all_ys, all_y_hats)

    back propagate (BPP) error and adjust weights

As you can see above, the predictions are accumulated and RMSE is calculated on the accumulated predictions rather then taking the mean of the all previous batch RMSE.

Implementation in keras

Now that our foundation is clear, lets see how we can implement tracking the same in keras. keras has callbacks, so we can hook into on_batch_begin callback and accumulate the all_y_hats and all_ys . On the on_batch_end callback keras gives us the calculated RMSE . We will manually calculate RMSE using our accumulated all_y_hats and all_ys and verify if it is same as what keras calculated. We will also save the weights so that we can later plot the line which is being learned.

import numpy as np
from sklearn.metrics import mean_squared_error
import keras
import matplotlib.pyplot as plt

# Some training data
X = np.arange(16)
y = 0.5*X +0.2

batch_size = 8
all_y_hats = []
learned_weights = [] 

class CustomCallback(keras.callbacks.Callback):
  def on_batch_begin(self, batch, logs={}):    
    w = self.model.layers[0].weights[0].numpy()[0][0]
    b = self.model.layers[0].weights[1].numpy()[0]    
    s = batch*batch_size
    all_y_hats.extend(b + w*X[s:s+batch_size])    
    learned_weights.append([w,b])

  def on_batch_end(self, batch, logs={}):    
    calculated_error = np.sqrt(mean_squared_error(all_y_hats, y[:len(all_y_hats)]))
    print (f"\n Calculated: {calculated_error},  Actual: {logs['root_mean_squared_error']}")
    assert np.isclose(calculated_error, logs['root_mean_squared_error'])

  def on_epoch_end(self, batch, logs={}):
    del all_y_hats[:]    


model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(1,)))
model.compile(optimizer=keras.optimizers.RMSprop(lr=0.01), loss="mean_squared_error",  metrics=[keras.metrics.RootMeanSquaredError()])
# We should set shuffle=False so that we know how baches are divided
history = model.fit(X,y, epochs=100, callbacks=[CustomCallback()], batch_size=batch_size, shuffle=False)

Output:

Epoch 1/100
 8/16 [==============>...............] - ETA: 0s - loss: 16.5132 - root_mean_squared_error: 4.0636
 Calculated: 4.063645694548688,  Actual: 4.063645839691162

 Calculated: 8.10112834945773,  Actual: 8.101128578186035
16/16 [==============================] - 0s 3ms/step - loss: 65.6283 - root_mean_squared_error: 8.1011
Epoch 2/100
 8/16 [==============>...............] - ETA: 0s - loss: 14.0454 - root_mean_squared_error: 3.7477
 Calculated: 3.7477213352845675,  Actual: 3.7477214336395264
-------------- truncated -----------------------

Ta-da! the assert assert np.isclose(calculated_error, logs['root_mean_squared_error']) never failed so our calculation/understanding is correct.

The line

Finally, lets plot the line which is being adjusted by the BPP algorithm based on the mean squared error loss. We can use the below code to create a png image of the line being learned at each batch along with the train data.

for i, (w,b) in enumerate(learned_weights):
  plt.close()
  plt.axis([-1, 18, -1, 10])
  plt.scatter(X, y)
  plt.plot([-1,17], [-1*w+b, 17*w+b], color='green')
  plt.savefig(f'img{i+1}.png')

Below is the gif animation of the above images in the order they are learned.

The hyperplane (line in this case) being learned when y = 0.5*X +5.2

Answer 2

I tried to play with it a little, and I think it is working like this:

weights (usually random, depending on settings) for each feature are initialized. Also bias, which is initially 0.0 is initiated.
loss and metrics for first batch are computed and printed and weights and bias are updated.
step 2. is repeated for all batches in epoch, however, after last batch loss and metrics are not printed, so what you see on screen are loss and metrics before last update in the epoch .
new epoch is started and first metrics and loss you see printed, are actually those one computed on last updated weights from previous epoch...

So basically I think that intuitively it can be told that first loss is computed, then weights are updated, which means, that weights update is last operation in epoch.

If your model is trained using one epoch and one batch, then what you see on screen is loss computed on initial weights and bias. If you want to see loss and metrics after end of each epoch (with most "actual" weights), you can pass to parameter validation_data=(X,y) to fit method. That tells the algorithm to compute loss and metrics once again on this given validation data, when epoch is finished.

Regarding initial weights of model, you can try it when you manually set some initial weights to the layer (using kernel_initializer parameter):

  model.add(tf.keras.layers.Dense(units=1,
                                  input_shape=(1,),
                                  kernel_initializer=tf.constant_initializer(.5)))

Here is the updated part of train_model function, which shows what I meant:

  def train_model(model, feature, label, epochs, batch_size):
        """Train the model by feeding it data."""

        # Feed the feature values and the label values to the
        # model. The model will train for the specified number
        # of epochs, gradually learning how the feature values
        # relate to the label values.
        init_slope = model.get_weights()[0][0][0]
        init_bias = model.get_weights()[1][0]
        print('init slope is {}'.format(init_slope))
        print('init bias is {}'.format(init_bias))

        history = model.fit(x=feature,
                          y=label,
                          batch_size=batch_size,
                          epochs=epochs,
                          validation_data=(feature,label))

        # Gather the trained model's weight and bias.
        #print(model.get_weights())
        trained_weight = model.get_weights()[0]
        trained_bias = model.get_weights()[1]
        print("Slope")
        print(trained_weight)
        print("Intercept")
        print(trained_bias)
        # The list of epochs is stored separately from the
        # rest of history.
        prediction_manual = [trained_weight[0][0]*i + trained_bias[0] for i in feature]

        manual_loss = np.mean(((np.array(label)-np.array(prediction_manual))**2))
        print('manually computed loss after slope and bias update is {}'.format(manual_loss))
        print('manually computed rmse after slope and bias update is {}'.format(manual_loss**(1/2)))

        prediction_manual_init = [init_slope*i + init_bias for i in feature]
        manual_loss_init = np.mean(((np.array(label)-np.array(prediction_manual_init))**2))
        print('manually computed loss with init slope and bias is {}'.format(manual_loss_init))
        print('manually copmuted loss with init slope and bias is {}'.format(manual_loss_init**(1/2)))

output:

"""
init slope is 0.5
init bias is 0.0
1/1 [==============================] - 0s 117ms/step - loss: 402.9850 - root_mean_squared_error: 20.0745 - val_loss: 352.3351 - val_root_mean_squared_error: 18.7706
Slope
[[0.65811384]]
Intercept
[0.15811387]
manually computed loss after slope and bias update is 352.3350379264957
manually computed rmse after slope and bias update is 18.77058970641295
manually computed loss with init slope and bias is 402.98499999999996
manually copmuted loss with init slope and bias is 20.074486294797182
"""

Note that manually computed loss and metrics after slope and bias update matches to validation loss and metrics and manually computed loss and metrics before update matches the loss and metrics of initial slope and bias.

Regarding second question, I think that you could split your data into batches manually and then iterate over each batch and fit on it. Then, in each iteration, model prints loss and metrics for validation data. Something like this:

  init_slope = model.get_weights()[0][0][0]
  init_bias = model.get_weights()[1][0]
  print('init slope is {}'.format(init_slope))
  print('init bias is {}'.format(init_bias))
  batch_size = 3

  for idx in range(0,len(feature),batch_size):
      model.fit(x=feature[idx:idx+batch_size],
                y=label[idx:idx+batch_size],
                batch_size=1000,
                epochs=epochs,
                validation_data=(feature,label))
      print('slope: {}'.format(model.get_weights()[0][0][0]))
      print('intercept: {}'.format(model.get_weights()[1][0]))
      print('x data used: {}'.format(feature[idx:idx+batch_size]))
      print('y data used: {}'.format(label[idx:idx+batch_size]))

output:

init slope is 0.5
init bias is 0.0
1/1 [==============================] - 0s 117ms/step - loss: 48.9000 - root_mean_squared_error: 6.9929 - val_loss: 352.3351 - val_root_mean_squared_error: 18.7706
slope: 0.6581138372421265
intercept: 0.15811386704444885
x data used: [1.0, 2.0, 3.0]
y data used: [5.0, 8.8, 9.6]
1/1 [==============================] - 0s 21ms/step - loss: 200.9296 - root_mean_squared_error: 14.1750 - val_loss: 306.3082 - val_root_mean_squared_error: 17.5017
slope: 0.8132714033126831
intercept: 0.3018075227737427
x data used: [4.0, 5.0, 6.0]
y data used: [14.2, 18.8, 19.5]
1/1 [==============================] - 0s 22ms/step - loss: 363.2630 - root_mean_squared_error: 19.0595 - val_loss: 266.7119 - val_root_mean_squared_error: 16.3313
slope: 0.9573485255241394
intercept: 0.42669767141342163
x data used: [7.0, 8.0, 9.0]
y data used: [21.4, 26.8, 28.9]
1/1 [==============================] - 0s 22ms/step - loss: 565.5593 - root_mean_squared_error: 23.7815 - val_loss: 232.1553 - val_root_mean_squared_error: 15.2366
slope: 1.0924618244171143
intercept: 0.5409283638000488
x data used: [10.0, 11.0, 12.0]
y data used: [32.0, 33.8, 38.2]

Answer 3

Linear Regression Model

Linear Regression Model has only one neuron with linear activation function. The basics about the training the model is that we use Gradient Descent. Each time the entire data is passed through the model and the weights are updated it is called 1 epoch . However the concept of iteration and epoch is no different here.

Basic Training Steps :

Prepare data
Initialize the model and its parameters (weights and biases)
for each epoch:  #(both iteration and epoch same here)
    Forward Propagation
    Compute Cost
    Back Propagation
    Update Parameters

Gradient Descent has three Variants :

Batch Gradient Descent (BDG)
Stochastic Gradient Descent (SDG)
Mini-Batch Gradient Descent (MDG)

Batch gradient Descent is what we talked about earlier (passing entire data). In general also known as Gradient Descent.

In Stochastic Gradient Descent we pass 1 random example at a time and the weight is updated with every example passed. Now the iteration comes into play. On completion of training the model with 1 example, 1 iteration is completed. However there are more examples in the data set that the model has not seen yet. Completely training all those examples is called 1 epoch . Since 1 example is passed at a time SDG is very slow for larger data set as it losses the effect of vectorization.

So we generally use Mini-Batch Gradient Descent . Here the data set is divided into a number of chunks of fixed size. The size of each chunk of data is called the batch size and it can be anywhere between 1 and the data size. On each Epoch these batches of data are used to train the model.

1 iteration processes 1 batch of data. 1 epoch processes entire batches of data. 1 epoch contains 1 or more iterations.

Thus, if the size of data is m, the data fed during each iteration are:

BDG = m
SDG = 1
MDG = 1 < x < m

Basic Training Steps for MGD :

Prepare data
Initialize the model and its parameters (weights and biases)
for each epoch:  #(epoch)
    for each mini_batch: #(iteration)
        Forward Propagation
        Compute Cost
        Back Propagation
        Update Parameters

This is the theoretical concept behind Gradient Descents, batch, epoch and Iteration.

Now moving on to Keras and your code:

I ran you Colab Code and its working perfectly fine. In the code you have posted the number of epoch is 1 that is extremely small for the model to learn since there is very little data and the model itself is very simple. So you need to either increase the data volume or create more complex model or train for larger number of epoch from 400-500 so far I found from the notebook. On properly adjusting the learning rate, the epoch number can be decreased as such

learning_rate=0.14
epochs=70
my_batch_size= 32 

my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature, 
                                                        my_label, epochs,
                                                        my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)

If the learning rate is very small the model will learn slowly so it requires larger training cycles(epoch) to do more accurate prediction. Increasing the learning rate the learning process speeds up so the epochs can be decreased. Please compare different sections of the code in the colab for proper examples.

Regarding getting metrics for each iteration:

Keras is a High-level API of TensorFlow. So far I know(not considering the customization of the API), During the training in Keras it calculates the loss, errors and accuracy for the training set at the end of each iteration and at the end of each epoch it returns their respective average. So if there are n epochs then there would be n number of each of these metrics no matter how many iterations comes in between.

Regarding the slope and the intercept:

Linear regression model use the linear activation function at the output layer which is y = mx + c . For the values we have

y - refers to the output
x - refers to the inputs
m - refers to the slope (that has to be adjusted)
c - refers to the intercept (that can also be adjusted)

In our model these m and c are what we adjust. They are the weight and bias of our model. So our function looks like y = Wx + b where b gives the intercept and w gives the slope . Weights and biases are initialized randomly at the beginning.

Colab link for Linear Regression Model from Scratch

Please tweak the values as needed. Since the model is implemented from scratch, collect or print any value you want to track during the training. You may also use your own data set, but make sure it is valid or generated by some library for model validation(sklearn).

https://colab.research.google.com/drive/1RfuRNMoVv-l6KyM_SegdJOHiXD_0xBHq?usp=sharing

PS If you find any thing confusing, please Comment. I would be happy to reply.

Problems understanding linear regression model tuning in tf.keras

Question

3 answers

solution1
2 2020-06-28 18:34:27

Foundation

Problem statement

Fitting the line/Training the model

Gradient decent

Keras implementation of batches

Implementation in keras

The line

solution2
1 2020-06-25 20:52:55

solution3
-1 2020-06-26 02:08:30

Linear Regression Model

Now moving on to Keras and your code:

Regarding getting metrics for each iteration:

Regarding the slope and the intercept:

Colab link for Linear Regression Model from Scratch

Problems understanding linear regression model tuning in tf.keras

Question

3 answers

solution1 2 2020-06-28 18:34:27

Foundation

Problem statement

Fitting the line/Training the model

Gradient decent

Keras implementation of batches

Implementation in keras

The line

solution2 1 2020-06-25 20:52:55

solution3 -1 2020-06-26 02:08:30

Linear Regression Model

Now moving on to Keras and your code:

Regarding getting metrics for each iteration:

Regarding the slope and the intercept:

Colab link for Linear Regression Model from Scratch

solution1
2 2020-06-28 18:34:27

solution2
1 2020-06-25 20:52:55

solution3
-1 2020-06-26 02:08:30