I have a training set in a Pandas dataframe, and I pass this data frame into model.fit()
with df.values
. Here is some information about the df:
df.values.shape
# (981, 5)
df.values[0]
# array([163, 0.6, 83, 0.52,
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0])], dtype=object)
As you can see, rows in the df contain 5 columns, 4 of which contain numerical values (either int or float), and one which contains a hot encoded array representing some categorical data. I am creating my keras model as seen below:
model = keras.Sequential([
keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(256, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
model.compile(optimizer=opt,
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(df.values, df_labels.values, epochs=10, batch_size=32, verbose=0)
df_labels.values
is just a 1D array of 0s and 1s. So I believe I do need a Dense(1) sigmoid layer at the end, as well as 'binary_crossentropy' loss.
This model works excellent if I only pass numerical data. But as soon as I introduce hot encodings (categorical data), I get this error:
ValueError Traceback (most recent call last)
<ipython-input-91-b5e6232b375f> in <module>
42 #trn_values = df_training_set.values[:,:,len(df_training_set.columns)]
43 #trn_cat = df_trn_wtid.values.reshape(-1, 1)
---> 44 model.fit(df_training_set.values, df_training_labels.values, epochs=10, batch_size=32, verbose=0)
45
46 #test_loss, test_acc = model.evaluate(df_test_set.values, df_test_labels.values)
~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1037 initial_epoch=initial_epoch,
1038 steps_per_epoch=steps_per_epoch,
-> 1039 validation_steps=validation_steps)
1040
1041 def evaluate(self, x=None, y=None,
~\Anaconda3\lib\site-packages\keras\engine\training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
197 ins_batch[i] = ins_batch[i].toarray()
198
--> 199 outs = f(ins_batch)
200 outs = to_list(outs)
201 for l, o in zip(out_labels, outs):
~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
2713 return self._legacy_call(inputs)
2714
-> 2715 return self._call(inputs)
2716 else:
2717 if py_any(is_tensor(x) for x in inputs):
~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
2653 array_vals.append(
2654 np.asarray(value,
-> 2655 dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
2656 if self.feed_dict:
2657 for key in sorted(self.feed_dict.keys()):
~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
Please do not suggest expanding out each value in the one_hot arrays into their own columns. This example is a trimmed down version of my dataset, which contains 6-8 categorical columns, some of the one_hots are arrays of 5000+ size. So this is not a feasible solution for me. I'm looking to perhaps refine my Sequential model (or overhaul the keras model completely) in order to process categorical data along with numerical data.
Remember, the training labels are 1D array of 0/1 values. I need both numerical/categorical training sets predicting one set of outcomes, I can't have one set of predictions from the numerical data and one set of predictions from the categorical data.
If flattening the 5000+ one-hot encoded array is a problem, maybe go with an embedding 1st layer instead. Also, what you can do is have a model (defined with the functional API instead of the sequential API as you do) that takes 2 inputs, one for numerical input and another for the categorical data. The categorical data can then go through the embedding and then through a concatenate layer with the numerical input. From there on, your model proceeds as you currently do (1024-cell layer...).
On the off chance you're still working on this, here is something I reused from tensorflow Exception encountered when calling layer (type CategoryEncoding)
You should be fine as long as you don't change the timestep from 1. I also included the output that yields excellent accuracy... because I reused one of the features for the label. Your question was more about how to get categorical and numerical fed into a model. You'll take it from there.
import tensorflow as tf
import pandas as pd
import numpy as np
# Simulate a data set of categorical and numerical values
# Configure simulation specifications: {feature: number of unique categories or None for numerical
theSimSpecs = {'Cat1': 54, 'Cat2': 2, 'Cat3': 4, 'Num1': None, 'Num2': None}
# batch size and timesteps
theBatchSz, theTimeSteps= 10, 1
# Creation of the dataset as pandas.DataFrame
theDFs = []
for theFeature, theUniques in theSimSpecs.items():
if theUniques is None:
theDF = pd.DataFrame(np.random.random(size=theBatchSz * theTimeSteps), columns=[theFeature])
else:
theDF = pd.DataFrame(np.random.randint(low=0, high=theUniques, size=theBatchSz * theTimeSteps),
columns=[theFeature]).astype('category')
theDFs.append(theDF)
theDF = pd.concat(theDFs, axis=1)
# code excerpt
# inventory of the categorical features' values ( None for the numerical)
theCatCodes = {theCol: (theDF[theCol].unique().tolist() if str(theDF[theCol].dtypes) == "category" else None)
for theCol in theDF.columns}
# Creation of the batched tensorflow.data.Dataset
theDS = tf.data.Dataset.from_tensor_slices(dict(theDF))
theDS = theDS.window(size=theTimeSteps, shift=1, stride=1, drop_remainder=True)
theDS = theDS.flat_map(lambda x: tf.data.Dataset.zip(x))
theDS = theDS.batch(batch_size=theBatchSz, drop_remainder=True)
# extracting one batch
theBatch = next(iter(theDS))
tf.print(theBatch)
# Creation of the components for the interface layer
theFeaturesInputs = {}
theFeaturesEncoded = {}
for theFeature, theCodes in theCatCodes.items():
if theCodes is None: # Pass-through for numerical features
theNumInput = tf.keras.layers.Input(shape=[], dtype=tf.float32, name=theFeature)
theFeaturesInputs[theFeature] = theNumInput
theFeatureExp = tf.expand_dims(input=theNumInput, axis=-1)
theFeaturesEncoded[theFeature] = theFeatureExp
else: # Process for categorical features
theCatInput = tf.keras.layers.Input(shape=[], dtype=tf.int64, name=theFeature)
theFeaturesInputs[theFeature] = theCatInput
theFeatureExp = tf.expand_dims(input=theCatInput, axis=-1)
theEncodingLayer = tf.keras.layers.CategoryEncoding(num_tokens=theSimSpecs[theFeature],
name=f"{theFeature}_enc",
output_mode="one_hot", sparse=False)
theFeaturesEncoded[theFeature] = theEncodingLayer(theFeatureExp)
### Below is what you'd be interested in
theStackedInputs = tf.concat(tf.nest.flatten(theFeaturesEncoded), axis=1)
theModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theStackedInputs)
theOutputs = theModel(theBatch)
tf.print(theOutputs[:5], summarize=-1)
x = tf.keras.layers.Dense(1024, activation=tf.nn.relu)(theStackedInputs)
x = tf.keras.layers.Dense(512, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(256, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(128, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(64, activation=tf.nn.relu)(x)
theModelOutputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(x)
theFullModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theModelOutputs)
opt = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
theFullModel.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
# Yes! I was not super inspired and used one of the categorical columns as the label.
theFullModel.fit(x=theBatch, y=theBatch["Cat2"], epochs=10, verbose=1)
theOutput = theFullModel(theBatch)
tf.print(theOutput, summarize=-1)
# No big whoop, they match. I only showed how to write the way to get the categorical and numerical
# to be fed to the model in one go
tf.print(theBatch["Cat2"], summarize=-1)
{'Cat1': [10 49 43 ... 12 30 16], 'Cat2': [0 1 0 ... 0 0 1], 'Cat3': [0 0 3 ... 2 0 3], 'Num1': [0.61139996409794306 0.5939044614577218 0.59720125040942829 ... 0.60667441398817357 0.29892784677577522 0.18648129276910852], 'Num2': [0.10468088989623936 0.18483904850190647 0.053207335764727581 ... 0.065936326328562944 0.50927258084392213 0.14051443214527148]}
[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0.611399949 0.104680888]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0.593904436 0.184839055]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.597201228 0.0532073341]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.668277 0.012965099]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0.594983 0.969921947]]
Epoch 1/10
1/1 [==============================] - 1s 542ms/step - loss: 0.6840 - accuracy: 1.0000
Epoch 2/10
1/1 [==============================] - 0s 3ms/step - loss: 0.6190 - accuracy: 1.0000
Epoch 3/10
1/1 [==============================] - 0s 3ms/step - loss: 0.5387 - accuracy: 1.0000
Epoch 4/10
1/1 [==============================] - 0s 3ms/step - loss: 0.4398 - accuracy: 1.0000
Epoch 5/10
1/1 [==============================] - 0s 3ms/step - loss: 0.3275 - accuracy: 1.0000
Epoch 6/10
1/1 [==============================] - 0s 3ms/step - loss: 0.2174 - accuracy: 1.0000
Epoch 7/10
1/1 [==============================] - 0s 3ms/step - loss: 0.1264 - accuracy: 1.0000
Epoch 8/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0643 - accuracy: 1.0000
Epoch 9/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0280 - accuracy: 1.0000
Epoch 10/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0104 - accuracy: 1.0000
[[0.000557512045]
[0.990028]
[0.000891149044]
[0.000809252262]
[0.000103618848]
[0.990051866]
[4.47019956e-05]
[0.000348180532]
[0.000630795956]
[0.990045428]]
[0 1 0 0 0 1 0 0 0 1]
Process finished with exit code 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.