简体   繁体   中英

Mixing numerical and categorical data into keras sequential model with Dense layers

I have a training set in a Pandas dataframe, and I pass this data frame into model.fit() with df.values . Here is some information about the df:

df.values.shape
# (981, 5)

df.values[0]
# array([163, 0.6, 83, 0.52,
#       array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0])], dtype=object)

As you can see, rows in the df contain 5 columns, 4 of which contain numerical values (either int or float), and one which contains a hot encoded array representing some categorical data. I am creating my keras model as seen below:

model = keras.Sequential([
    keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(256, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)

model.compile(optimizer=opt, 
      loss='binary_crossentropy',
      metrics=['accuracy'])

model.fit(df.values, df_labels.values, epochs=10, batch_size=32, verbose=0)

df_labels.values is just a 1D array of 0s and 1s. So I believe I do need a Dense(1) sigmoid layer at the end, as well as 'binary_crossentropy' loss.

This model works excellent if I only pass numerical data. But as soon as I introduce hot encodings (categorical data), I get this error:

ValueError                                Traceback (most recent call last)
<ipython-input-91-b5e6232b375f> in <module>
     42     #trn_values = df_training_set.values[:,:,len(df_training_set.columns)]
     43     #trn_cat = df_trn_wtid.values.reshape(-1, 1)
---> 44     model.fit(df_training_set.values, df_training_labels.values, epochs=10, batch_size=32, verbose=0)
     45 
     46     #test_loss, test_acc = model.evaluate(df_test_set.values, df_test_labels.values)

~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1037                                         initial_epoch=initial_epoch,
   1038                                         steps_per_epoch=steps_per_epoch,
-> 1039                                         validation_steps=validation_steps)
   1040 
   1041     def evaluate(self, x=None, y=None,

~\Anaconda3\lib\site-packages\keras\engine\training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    197                     ins_batch[i] = ins_batch[i].toarray()
    198 
--> 199                 outs = f(ins_batch)
    200                 outs = to_list(outs)
    201                 for l, o in zip(out_labels, outs):

~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
   2653                 array_vals.append(
   2654                     np.asarray(value,
-> 2655                                dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
   2656         if self.feed_dict:
   2657             for key in sorted(self.feed_dict.keys()):

~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: setting an array element with a sequence.

Please do not suggest expanding out each value in the one_hot arrays into their own columns. This example is a trimmed down version of my dataset, which contains 6-8 categorical columns, some of the one_hots are arrays of 5000+ size. So this is not a feasible solution for me. I'm looking to perhaps refine my Sequential model (or overhaul the keras model completely) in order to process categorical data along with numerical data.

Remember, the training labels are 1D array of 0/1 values. I need both numerical/categorical training sets predicting one set of outcomes, I can't have one set of predictions from the numerical data and one set of predictions from the categorical data.

If flattening the 5000+ one-hot encoded array is a problem, maybe go with an embedding 1st layer instead. Also, what you can do is have a model (defined with the functional API instead of the sequential API as you do) that takes 2 inputs, one for numerical input and another for the categorical data. The categorical data can then go through the embedding and then through a concatenate layer with the numerical input. From there on, your model proceeds as you currently do (1024-cell layer...).

On the off chance you're still working on this, here is something I reused from tensorflow Exception encountered when calling layer (type CategoryEncoding)

You should be fine as long as you don't change the timestep from 1. I also included the output that yields excellent accuracy... because I reused one of the features for the label. Your question was more about how to get categorical and numerical fed into a model. You'll take it from there.

import tensorflow as tf
import pandas as pd
import numpy as np

# Simulate a data set of categorical and numerical values
# Configure simulation specifications: {feature: number of unique categories or None for numerical
theSimSpecs = {'Cat1': 54, 'Cat2': 2, 'Cat3': 4, 'Num1': None, 'Num2': None}

# batch size and timesteps
theBatchSz, theTimeSteps= 10, 1

# Creation of the dataset as pandas.DataFrame
theDFs = []
for theFeature, theUniques in theSimSpecs.items():
    if theUniques is None:
        theDF = pd.DataFrame(np.random.random(size=theBatchSz * theTimeSteps), columns=[theFeature])
    else:
        theDF = pd.DataFrame(np.random.randint(low=0, high=theUniques, size=theBatchSz * theTimeSteps),
                             columns=[theFeature]).astype('category')
    theDFs.append(theDF)
theDF = pd.concat(theDFs, axis=1)

# code excerpt
# inventory of the categorical features' values ( None for the numerical)
theCatCodes = {theCol: (theDF[theCol].unique().tolist() if str(theDF[theCol].dtypes) == "category" else None)
               for theCol in theDF.columns}

# Creation of the batched tensorflow.data.Dataset
theDS = tf.data.Dataset.from_tensor_slices(dict(theDF))
theDS = theDS.window(size=theTimeSteps, shift=1, stride=1, drop_remainder=True)
theDS = theDS.flat_map(lambda x: tf.data.Dataset.zip(x))
theDS = theDS.batch(batch_size=theBatchSz, drop_remainder=True)

# extracting one batch
theBatch = next(iter(theDS))
tf.print(theBatch)

# Creation of the components for the interface layer
theFeaturesInputs = {}
theFeaturesEncoded = {}

for theFeature, theCodes in theCatCodes.items():
    if theCodes is None: # Pass-through for numerical features
        theNumInput = tf.keras.layers.Input(shape=[], dtype=tf.float32, name=theFeature)
        theFeaturesInputs[theFeature] = theNumInput

        theFeatureExp = tf.expand_dims(input=theNumInput, axis=-1)
        theFeaturesEncoded[theFeature] = theFeatureExp

    else: # Process for categorical features
        theCatInput = tf.keras.layers.Input(shape=[], dtype=tf.int64, name=theFeature)
        theFeaturesInputs[theFeature] = theCatInput

        theFeatureExp = tf.expand_dims(input=theCatInput, axis=-1)
        theEncodingLayer = tf.keras.layers.CategoryEncoding(num_tokens=theSimSpecs[theFeature],
                                                            name=f"{theFeature}_enc",
                                                            output_mode="one_hot", sparse=False)
        theFeaturesEncoded[theFeature] = theEncodingLayer(theFeatureExp)

### Below is what you'd be interested in

theStackedInputs = tf.concat(tf.nest.flatten(theFeaturesEncoded), axis=1)

theModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theStackedInputs)
theOutputs = theModel(theBatch)
tf.print(theOutputs[:5], summarize=-1)

x = tf.keras.layers.Dense(1024, activation=tf.nn.relu)(theStackedInputs)
x = tf.keras.layers.Dense(512, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(256, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(128, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(64, activation=tf.nn.relu)(x)
theModelOutputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(x)

theFullModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theModelOutputs)
opt = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)

theFullModel.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

# Yes! I was not super inspired and used one of the categorical columns as the label.
theFullModel.fit(x=theBatch, y=theBatch["Cat2"], epochs=10, verbose=1)

theOutput = theFullModel(theBatch)
tf.print(theOutput, summarize=-1)
# No big whoop, they match.  I only showed how to write the way to get the categorical and numerical
# to be fed to the model in one go
tf.print(theBatch["Cat2"], summarize=-1)

The output:

{'Cat1': [10 49 43 ... 12 30 16], 'Cat2': [0 1 0 ... 0 0 1], 'Cat3': [0 0 3 ... 2 0 3], 'Num1': [0.61139996409794306 0.5939044614577218 0.59720125040942829 ... 0.60667441398817357 0.29892784677577522 0.18648129276910852], 'Num2': [0.10468088989623936 0.18483904850190647 0.053207335764727581 ... 0.065936326328562944     0.50927258084392213 0.14051443214527148]}
[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0.611399949 0.104680888]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0.593904436 0.184839055]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.597201228 0.0532073341]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.668277 0.012965099]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0.594983 0.969921947]]
Epoch 1/10
1/1 [==============================] - 1s 542ms/step - loss: 0.6840 - accuracy: 1.0000
Epoch 2/10
1/1 [==============================] - 0s 3ms/step - loss: 0.6190 - accuracy: 1.0000
Epoch 3/10
1/1 [==============================] - 0s 3ms/step - loss: 0.5387 - accuracy: 1.0000
Epoch 4/10
1/1 [==============================] - 0s 3ms/step - loss: 0.4398 - accuracy: 1.0000
Epoch 5/10
1/1 [==============================] - 0s 3ms/step - loss: 0.3275 - accuracy: 1.0000
Epoch 6/10
1/1 [==============================] - 0s 3ms/step - loss: 0.2174 - accuracy: 1.0000
Epoch 7/10
1/1 [==============================] - 0s 3ms/step - loss: 0.1264 - accuracy: 1.0000
Epoch 8/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0643 - accuracy: 1.0000
Epoch 9/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0280 - accuracy: 1.0000
Epoch 10/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0104 - accuracy: 1.0000
[[0.000557512045]
 [0.990028]
 [0.000891149044]
 [0.000809252262]
 [0.000103618848]
 [0.990051866]
 [4.47019956e-05]
 [0.000348180532]
 [0.000630795956]
 [0.990045428]]
[0 1 0 0 0 1 0 0 0 1]

Process finished with exit code 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM