[英]Mixing numerical and categorical data into keras sequential model with Dense layers
I have a training set in a Pandas dataframe, and I pass this data frame into model.fit()
with df.values
.我在 Pandas 数据框中有一个训练集,我将这个数据框传递给带有
df.values
的model.fit()
。 Here is some information about the df:以下是有关 df 的一些信息:
df.values.shape
# (981, 5)
df.values[0]
# array([163, 0.6, 83, 0.52,
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0])], dtype=object)
As you can see, rows in the df contain 5 columns, 4 of which contain numerical values (either int or float), and one which contains a hot encoded array representing some categorical data.如您所见,df 中的行包含 5 列,其中 4 列包含数值(int 或 float),还有一个包含表示一些分类数据的热编码数组。 I am creating my keras model as seen below:
我正在创建我的 keras 模型,如下所示:
model = keras.Sequential([
keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(256, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
model.compile(optimizer=opt,
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(df.values, df_labels.values, epochs=10, batch_size=32, verbose=0)
df_labels.values
is just a 1D array of 0s and 1s. df_labels.values
只是一个 0 和 1 的一维数组。 So I believe I do need a Dense(1) sigmoid layer at the end, as well as 'binary_crossentropy' loss.所以我相信我最后确实需要一个 Dense(1) sigmoid 层,以及“binary_crossentropy”损失。
This model works excellent if I only pass numerical data.如果我只传递数字数据,这个模型效果很好。 But as soon as I introduce hot encodings (categorical data), I get this error:
但是一旦我引入热编码(分类数据),我就会得到这个错误:
ValueError Traceback (most recent call last)
<ipython-input-91-b5e6232b375f> in <module>
42 #trn_values = df_training_set.values[:,:,len(df_training_set.columns)]
43 #trn_cat = df_trn_wtid.values.reshape(-1, 1)
---> 44 model.fit(df_training_set.values, df_training_labels.values, epochs=10, batch_size=32, verbose=0)
45
46 #test_loss, test_acc = model.evaluate(df_test_set.values, df_test_labels.values)
~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1037 initial_epoch=initial_epoch,
1038 steps_per_epoch=steps_per_epoch,
-> 1039 validation_steps=validation_steps)
1040
1041 def evaluate(self, x=None, y=None,
~\Anaconda3\lib\site-packages\keras\engine\training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
197 ins_batch[i] = ins_batch[i].toarray()
198
--> 199 outs = f(ins_batch)
200 outs = to_list(outs)
201 for l, o in zip(out_labels, outs):
~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
2713 return self._legacy_call(inputs)
2714
-> 2715 return self._call(inputs)
2716 else:
2717 if py_any(is_tensor(x) for x in inputs):
~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
2653 array_vals.append(
2654 np.asarray(value,
-> 2655 dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
2656 if self.feed_dict:
2657 for key in sorted(self.feed_dict.keys()):
~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
Please do not suggest expanding out each value in the one_hot arrays into their own columns.请不要建议将 one_hot 数组中的每个值展开到它们自己的列中。 This example is a trimmed down version of my dataset, which contains 6-8 categorical columns, some of the one_hots are arrays of 5000+ size.
此示例是我的数据集的精简版本,其中包含 6-8 个分类列,其中一些 one_hots 是 5000+ 大小的数组。 So this is not a feasible solution for me.
所以这对我来说不是一个可行的解决方案。 I'm looking to perhaps refine my Sequential model (or overhaul the keras model completely) in order to process categorical data along with numerical data.
我希望改进我的顺序模型(或彻底检查 keras 模型),以便处理分类数据和数值数据。
Remember, the training labels are 1D array of 0/1 values.请记住,训练标签是 0/1 值的一维数组。 I need both numerical/categorical training sets predicting one set of outcomes, I can't have one set of predictions from the numerical data and one set of predictions from the categorical data.
我需要预测一组结果的数值/分类训练集,我不能有一组来自数值数据的预测和一组来自分类数据的预测。
If flattening the 5000+ one-hot encoded array is a problem, maybe go with an embedding 1st layer instead.如果展平 5000+ one-hot 编码数组是一个问题,则可以使用嵌入第一层来代替。 Also, what you can do is have a model (defined with the functional API instead of the sequential API as you do) that takes 2 inputs, one for numerical input and another for the categorical data.
此外,您可以做的是有一个模型(使用功能 API定义,而不是像您那样使用顺序 API 定义),它需要 2 个输入,一个用于数字输入,另一个用于分类数据。 The categorical data can then go through the embedding and then through a concatenate layer with the numerical input.
然后,分类数据可以通过嵌入,然后通过带有数字输入的连接层。 From there on, your model proceeds as you currently do (1024-cell layer...).
从那里开始,您的模型将按照您当前的方式进行(1024 单元层...)。
On the off chance you're still working on this, here is something I reused from tensorflow Exception encountered when calling layer (type CategoryEncoding)如果您仍在处理此问题,这是我从调用层时遇到的 tensorflow 异常中重用的内容(类型 CategoryEncoding)
You should be fine as long as you don't change the timestep from 1. I also included the output that yields excellent accuracy... because I reused one of the features for the label.只要您不从 1 更改时间步长,就应该没问题。我还包括了产生出色准确性的输出……因为我重用了标签的一个功能。 Your question was more about how to get categorical and numerical fed into a model.
您的问题更多是关于如何将分类和数字输入模型。 You'll take it from there.
你会从那里拿走它。
import tensorflow as tf
import pandas as pd
import numpy as np
# Simulate a data set of categorical and numerical values
# Configure simulation specifications: {feature: number of unique categories or None for numerical
theSimSpecs = {'Cat1': 54, 'Cat2': 2, 'Cat3': 4, 'Num1': None, 'Num2': None}
# batch size and timesteps
theBatchSz, theTimeSteps= 10, 1
# Creation of the dataset as pandas.DataFrame
theDFs = []
for theFeature, theUniques in theSimSpecs.items():
if theUniques is None:
theDF = pd.DataFrame(np.random.random(size=theBatchSz * theTimeSteps), columns=[theFeature])
else:
theDF = pd.DataFrame(np.random.randint(low=0, high=theUniques, size=theBatchSz * theTimeSteps),
columns=[theFeature]).astype('category')
theDFs.append(theDF)
theDF = pd.concat(theDFs, axis=1)
# code excerpt
# inventory of the categorical features' values ( None for the numerical)
theCatCodes = {theCol: (theDF[theCol].unique().tolist() if str(theDF[theCol].dtypes) == "category" else None)
for theCol in theDF.columns}
# Creation of the batched tensorflow.data.Dataset
theDS = tf.data.Dataset.from_tensor_slices(dict(theDF))
theDS = theDS.window(size=theTimeSteps, shift=1, stride=1, drop_remainder=True)
theDS = theDS.flat_map(lambda x: tf.data.Dataset.zip(x))
theDS = theDS.batch(batch_size=theBatchSz, drop_remainder=True)
# extracting one batch
theBatch = next(iter(theDS))
tf.print(theBatch)
# Creation of the components for the interface layer
theFeaturesInputs = {}
theFeaturesEncoded = {}
for theFeature, theCodes in theCatCodes.items():
if theCodes is None: # Pass-through for numerical features
theNumInput = tf.keras.layers.Input(shape=[], dtype=tf.float32, name=theFeature)
theFeaturesInputs[theFeature] = theNumInput
theFeatureExp = tf.expand_dims(input=theNumInput, axis=-1)
theFeaturesEncoded[theFeature] = theFeatureExp
else: # Process for categorical features
theCatInput = tf.keras.layers.Input(shape=[], dtype=tf.int64, name=theFeature)
theFeaturesInputs[theFeature] = theCatInput
theFeatureExp = tf.expand_dims(input=theCatInput, axis=-1)
theEncodingLayer = tf.keras.layers.CategoryEncoding(num_tokens=theSimSpecs[theFeature],
name=f"{theFeature}_enc",
output_mode="one_hot", sparse=False)
theFeaturesEncoded[theFeature] = theEncodingLayer(theFeatureExp)
### Below is what you'd be interested in
theStackedInputs = tf.concat(tf.nest.flatten(theFeaturesEncoded), axis=1)
theModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theStackedInputs)
theOutputs = theModel(theBatch)
tf.print(theOutputs[:5], summarize=-1)
x = tf.keras.layers.Dense(1024, activation=tf.nn.relu)(theStackedInputs)
x = tf.keras.layers.Dense(512, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(256, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(128, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(64, activation=tf.nn.relu)(x)
theModelOutputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(x)
theFullModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theModelOutputs)
opt = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
theFullModel.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
# Yes! I was not super inspired and used one of the categorical columns as the label.
theFullModel.fit(x=theBatch, y=theBatch["Cat2"], epochs=10, verbose=1)
theOutput = theFullModel(theBatch)
tf.print(theOutput, summarize=-1)
# No big whoop, they match. I only showed how to write the way to get the categorical and numerical
# to be fed to the model in one go
tf.print(theBatch["Cat2"], summarize=-1)
{'Cat1': [10 49 43 ... 12 30 16], 'Cat2': [0 1 0 ... 0 0 1], 'Cat3': [0 0 3 ... 2 0 3], 'Num1': [0.61139996409794306 0.5939044614577218 0.59720125040942829 ... 0.60667441398817357 0.29892784677577522 0.18648129276910852], 'Num2': [0.10468088989623936 0.18483904850190647 0.053207335764727581 ... 0.065936326328562944 0.50927258084392213 0.14051443214527148]}
[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0.611399949 0.104680888]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0.593904436 0.184839055]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.597201228 0.0532073341]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.668277 0.012965099]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0.594983 0.969921947]]
Epoch 1/10
1/1 [==============================] - 1s 542ms/step - loss: 0.6840 - accuracy: 1.0000
Epoch 2/10
1/1 [==============================] - 0s 3ms/step - loss: 0.6190 - accuracy: 1.0000
Epoch 3/10
1/1 [==============================] - 0s 3ms/step - loss: 0.5387 - accuracy: 1.0000
Epoch 4/10
1/1 [==============================] - 0s 3ms/step - loss: 0.4398 - accuracy: 1.0000
Epoch 5/10
1/1 [==============================] - 0s 3ms/step - loss: 0.3275 - accuracy: 1.0000
Epoch 6/10
1/1 [==============================] - 0s 3ms/step - loss: 0.2174 - accuracy: 1.0000
Epoch 7/10
1/1 [==============================] - 0s 3ms/step - loss: 0.1264 - accuracy: 1.0000
Epoch 8/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0643 - accuracy: 1.0000
Epoch 9/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0280 - accuracy: 1.0000
Epoch 10/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0104 - accuracy: 1.0000
[[0.000557512045]
[0.990028]
[0.000891149044]
[0.000809252262]
[0.000103618848]
[0.990051866]
[4.47019956e-05]
[0.000348180532]
[0.000630795956]
[0.990045428]]
[0 1 0 0 0 1 0 0 0 1]
Process finished with exit code 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.