简体   繁体   中英

Regression task for neural network with Keras and Tensorflow, shapes and formats of input data

I am studying Keras to build a neural.network for regression purposes.

I have obtained a dataset to train a model - each row represents a case with inputs in separate columns and output.

Here is the example dataset:

       x1    x2    x3    x4    x5       y
0     0.00  0.00  0.00  0.00  1.00  76.800
1     0.00  0.00  0.00  0.05  0.95  77.815
2     0.00  0.00  0.00  0.10  0.90  78.830
3     0.00  0.00  0.00  0.15  0.85  79.845
4     0.00  0.00  0.00  0.20  0.80  80.860
...    ...   ...   ...   ...   ...     ...
9108  0.95  0.00  0.00  0.00  0.05  94.945
9109  0.95  0.00  0.00  0.05  0.00  95.960
9110  0.95  0.00  0.05  0.00  0.00  95.550
9111  0.95  0.05  0.00  0.00  0.00  95.250
9112  1.00  0.00  0.00  0.00  0.00  95.900

the NN I am trying to build must use x1..x5 as inputs and calculate y - output - continuous variable.

As I understand I need to prepare training, validation and cross-validation samples using initial dataset I have.

Here are my newbie-questions about data preparation for learning process and data-saving for the use with Keras.

  1. How to prepare data if I have a simple dataframe (eg dataset I have now) - the most recommended way?

What is the most common way and format to prepare data for learning?? Is it possible to use a dataframe or it should be converted specific shaped numpy arrays or any other format - eg each row of the array must present a case??

Do I really have to convert my dataframe to specific shaped numpy arrays?

subset_learn_np = subset_learn_df.to_numpy()

Eg in this case I got a numpy array with shape: (9113, 5) 9113 cases - 5 inputs each.

At the same time I have created a numpy array with the results for each case and inputs from the previous array - splitted the initial dataframe and converted it to separate numpy array:

subset_answers_df = df[['y']]
subset_answers_np = subset_answers_df.to_numpy()
  1. Do I need to normalize this data before the learning process? How to create a regression model based on the data I have?

  2. In the next stage, I will probably use a dataset with millions of rows and up to several hundreds of inputs and multiple outputs.

As this data is now stored in separate files I need to merge it somehow - what do you suggest for this? Any special instruments to prepare such dataframes and work with such data?

As I have searched the web and docs for the answer I have found a great tutorial here: https://www.tensorflow.org/tutorials/keras/regression

I didn't need to transform the dataset I had to any specific format at that stage, use its fractions to split the dataset on training and test samples.

The normalizer was built using the data we have in our dataframe.

And here is the code I have for a simple linear model that works fine for me.

# splitting df to train and test 
train_dataset = df.sample(frac=0.8, random_state=0)
test_dataset = df.drop(train_dataset.index)

print('*'*80)
print('train_dataset')
print(train_dataset)
print('*'*80)
print('test_dataset')
print(test_dataset)


# Split features from labels

epochs_num=200

train_features = train_dataset.copy()
test_features = test_dataset.copy()

# it drops the oct_num from trin features to train_labels
train_labels = train_features.pop('y')
test_labels = test_features.pop('y')

print('*'*80)
print('training labels')
print(train_labels)

print('training features')
print(train_features)

# Normalization

print(train_dataset.describe().transpose()[['mean', 'std']])

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))
print('normalizer.mean.numpy()')
print(normalizer.mean.numpy())

  
def plot_loss(history, name):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.ylim([0, 10])
    plt.xlabel('Epoch')
    plt.ylabel('Error')
    plt.title(name)
    plt.legend()
    plt.grid(True)
    plt.show()

# trying multiple regression

print('*'*80)
print('linear model')

linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

#print(linear_model.predict(train_features[:10]))

print(linear_model.layers[1].kernel)

linear_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

history = linear_model.fit(
    train_features,
    train_labels,
    epochs=epochs_num,
    # Suppress logging.
    verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)

plot_loss(history, 'simple linear multivar')

test_results['linear_model'] = linear_model.evaluate(
    test_features, test_labels, verbose=0)

print(test_results)


error_linear =  linear_model.predict(test_features).flatten() - test_labels

plt.hist(error_linear, bins=50, alpha=0.5, label = 'linear', color='r')

plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid()
plt.legend()
plt.show()

# end Regression using a DNN and multiple inputs

# saving  models
linear_model.save('linear_model')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM