Keras 和 Tensorflow 的神经网络回归任务，输入数据的形状和格式

Question

I am studying Keras to build a neural.network for regression purposes.我正在研究 Keras 以构建一个用于回归目的的神经网络。

I have obtained a dataset to train a model - each row represents a case with inputs in separate columns and output.我已经获得了一个数据集来训练 model - 每行代表一个案例，输入在单独的列和 output 中。

Here is the example dataset:这是示例数据集：

       x1    x2    x3    x4    x5       y
0     0.00  0.00  0.00  0.00  1.00  76.800
1     0.00  0.00  0.00  0.05  0.95  77.815
2     0.00  0.00  0.00  0.10  0.90  78.830
3     0.00  0.00  0.00  0.15  0.85  79.845
4     0.00  0.00  0.00  0.20  0.80  80.860
...    ...   ...   ...   ...   ...     ...
9108  0.95  0.00  0.00  0.00  0.05  94.945
9109  0.95  0.00  0.00  0.05  0.00  95.960
9110  0.95  0.00  0.05  0.00  0.00  95.550
9111  0.95  0.05  0.00  0.00  0.00  95.250
9112  1.00  0.00  0.00  0.00  0.00  95.900

the NN I am trying to build must use x1..x5 as inputs and calculate y - output - continuous variable.我尝试构建的 NN 必须使用 x1..x5 作为输入并计算 y - output - 连续变量。

As I understand I need to prepare training, validation and cross-validation samples using initial dataset I have.据我了解，我需要使用我拥有的初始数据集来准备训练、验证和交叉验证样本。

Here are my newbie-questions about data preparation for learning process and data-saving for the use with Keras.这是我的新手问题，关于学习过程的数据准备和使用 Keras 的数据保存。

How to prepare data if I have a simple dataframe (eg dataset I have now) - the most recommended way?如果我有一个简单的 dataframe（例如我现在拥有的数据集），如何准备数据 - 最推荐的方法？

What is the most common way and format to prepare data for learning??准备学习数据的最常见方式和格式是什么？ Is it possible to use a dataframe or it should be converted specific shaped numpy arrays or any other format - eg each row of the array must present a case??是否可以使用 dataframe 或者它应该转换为特定形状 numpy arrays 或任何其他格式 - 例如数组的每一行必须呈现一个案例？

Do I really have to convert my dataframe to specific shaped numpy arrays?我真的必须将我的 dataframe 转换为特定形状的 numpy arrays 吗？

subset_learn_np = subset_learn_df.to_numpy()

Eg in this case I got a numpy array with shape: (9113, 5) 9113 cases - 5 inputs each.例如，在这种情况下，我得到一个 numpy 数组，其形状为：(9113, 5) 9113 个案例 - 每个案例有 5 个输入。

At the same time I have created a numpy array with the results for each case and inputs from the previous array - splitted the initial dataframe and converted it to separate numpy array:同时，我创建了一个 numpy 数组，其中包含每个案例的结果和前一个数组的输入 - 拆分初始 dataframe 并将其转换为单独的 numpy 数组：

subset_answers_df = df[['y']]
subset_answers_np = subset_answers_df.to_numpy()

Do I need to normalize this data before the learning process?我需要在学习过程之前规范化这些数据吗？ How to create a regression model based on the data I have?如何根据我拥有的数据创建回归 model？
In the next stage, I will probably use a dataset with millions of rows and up to several hundreds of inputs and multiple outputs.在下一阶段，我可能会使用具有数百万行和多达数百个输入和多个输出的数据集。

As this data is now stored in separate files I need to merge it somehow - what do you suggest for this?由于这些数据现在存储在单独的文件中，我需要以某种方式合并它 - 你对此有何建议？ Any special instruments to prepare such dataframes and work with such data?是否有任何特殊工具来准备此类数据框并使用此类数据？

Answer 1

As I have searched the web and docs for the answer I have found a great tutorial here: https://www.tensorflow.org/tutorials/keras/regression当我在 web 和文档中搜索答案时，我在这里找到了一个很棒的教程： https://www.tensorflow.org/tutorials/keras/regression

I didn't need to transform the dataset I had to any specific format at that stage, use its fractions to split the dataset on training and test samples.我不需要在那个阶段将我必须的数据集转换为任何特定格式，使用它的分数来分割训练和测试样本的数据集。

The normalizer was built using the data we have in our dataframe.标准化器是使用我们 dataframe 中的数据构建的。

And here is the code I have for a simple linear model that works fine for me.这是我的简单线性 model 的代码，对我来说效果很好。

# splitting df to train and test 
train_dataset = df.sample(frac=0.8, random_state=0)
test_dataset = df.drop(train_dataset.index)

print('*'*80)
print('train_dataset')
print(train_dataset)
print('*'*80)
print('test_dataset')
print(test_dataset)


# Split features from labels

epochs_num=200

train_features = train_dataset.copy()
test_features = test_dataset.copy()

# it drops the oct_num from trin features to train_labels
train_labels = train_features.pop('y')
test_labels = test_features.pop('y')

print('*'*80)
print('training labels')
print(train_labels)

print('training features')
print(train_features)

# Normalization

print(train_dataset.describe().transpose()[['mean', 'std']])

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))
print('normalizer.mean.numpy()')
print(normalizer.mean.numpy())

  
def plot_loss(history, name):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.ylim([0, 10])
    plt.xlabel('Epoch')
    plt.ylabel('Error')
    plt.title(name)
    plt.legend()
    plt.grid(True)
    plt.show()

# trying multiple regression

print('*'*80)
print('linear model')

linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

#print(linear_model.predict(train_features[:10]))

print(linear_model.layers[1].kernel)

linear_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

history = linear_model.fit(
    train_features,
    train_labels,
    epochs=epochs_num,
    # Suppress logging.
    verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)

plot_loss(history, 'simple linear multivar')

test_results['linear_model'] = linear_model.evaluate(
    test_features, test_labels, verbose=0)

print(test_results)


error_linear =  linear_model.predict(test_features).flatten() - test_labels

plt.hist(error_linear, bins=50, alpha=0.5, label = 'linear', color='r')

plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid()
plt.legend()
plt.show()

# end Regression using a DNN and multiple inputs

# saving  models
linear_model.save('linear_model')

Keras 和 Tensorflow 的神经网络回归任务，输入数据的形状和格式

问题描述

1 个解决方案

解决方案1
0 2023-01-01 10:43:47

Keras 和 Tensorflow 的神经网络回归任务，输入数据的形状和格式

问题描述

1 个解决方案

解决方案1 0 2023-01-01 10:43:47

解决方案1
0 2023-01-01 10:43:47