简体   繁体   English

将数值和分类数据混合到具有密集层的 keras 序列模型中

[英]Mixing numerical and categorical data into keras sequential model with Dense layers

I have a training set in a Pandas dataframe, and I pass this data frame into model.fit() with df.values .我在 Pandas 数据框中有一个训练集,我将这个数据框传递给带有df.valuesmodel.fit() Here is some information about the df:以下是有关 df 的一些信息:

df.values.shape
# (981, 5)

df.values[0]
# array([163, 0.6, 83, 0.52,
#       array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0])], dtype=object)

As you can see, rows in the df contain 5 columns, 4 of which contain numerical values (either int or float), and one which contains a hot encoded array representing some categorical data.如您所见,df 中的行包含 5 列,其中 4 列包含数值(int 或 float),还有一个包含表示一些分类数据的热编码数组。 I am creating my keras model as seen below:我正在创建我的 keras 模型,如下所示:

model = keras.Sequential([
    keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(256, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)

model.compile(optimizer=opt, 
      loss='binary_crossentropy',
      metrics=['accuracy'])

model.fit(df.values, df_labels.values, epochs=10, batch_size=32, verbose=0)

df_labels.values is just a 1D array of 0s and 1s. df_labels.values只是一个 0 和 1 的一维数组。 So I believe I do need a Dense(1) sigmoid layer at the end, as well as 'binary_crossentropy' loss.所以我相信我最后确实需要一个 Dense(1) sigmoid 层,以及“binary_crossentropy”损失。

This model works excellent if I only pass numerical data.如果我只传递数字数据,这个模型效果很好。 But as soon as I introduce hot encodings (categorical data), I get this error:但是一旦我引入热编码(分类数据),我就会得到这个错误:

ValueError                                Traceback (most recent call last)
<ipython-input-91-b5e6232b375f> in <module>
     42     #trn_values = df_training_set.values[:,:,len(df_training_set.columns)]
     43     #trn_cat = df_trn_wtid.values.reshape(-1, 1)
---> 44     model.fit(df_training_set.values, df_training_labels.values, epochs=10, batch_size=32, verbose=0)
     45 
     46     #test_loss, test_acc = model.evaluate(df_test_set.values, df_test_labels.values)

~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1037                                         initial_epoch=initial_epoch,
   1038                                         steps_per_epoch=steps_per_epoch,
-> 1039                                         validation_steps=validation_steps)
   1040 
   1041     def evaluate(self, x=None, y=None,

~\Anaconda3\lib\site-packages\keras\engine\training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    197                     ins_batch[i] = ins_batch[i].toarray()
    198 
--> 199                 outs = f(ins_batch)
    200                 outs = to_list(outs)
    201                 for l, o in zip(out_labels, outs):

~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
   2653                 array_vals.append(
   2654                     np.asarray(value,
-> 2655                                dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
   2656         if self.feed_dict:
   2657             for key in sorted(self.feed_dict.keys()):

~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: setting an array element with a sequence.

Please do not suggest expanding out each value in the one_hot arrays into their own columns.请不要建议将 one_hot 数组中的每个值展开到它们自己的列中。 This example is a trimmed down version of my dataset, which contains 6-8 categorical columns, some of the one_hots are arrays of 5000+ size.此示例是我的数据集的精简版本,其中包含 6-8 个分类列,其中一些 one_hots 是 5000+ 大小的数组。 So this is not a feasible solution for me.所以这对我来说不是一个可行的解决方案。 I'm looking to perhaps refine my Sequential model (or overhaul the keras model completely) in order to process categorical data along with numerical data.我希望改进我的顺序模型(或彻底检查 keras 模型),以便处理分类数据和数值数据。

Remember, the training labels are 1D array of 0/1 values.请记住,训练标签是 0/1 值的一维数组。 I need both numerical/categorical training sets predicting one set of outcomes, I can't have one set of predictions from the numerical data and one set of predictions from the categorical data.我需要预测一组结果的数值/分类训练集,我不能有一组来自数值数据的预测和一组来自分类数据的预测。

If flattening the 5000+ one-hot encoded array is a problem, maybe go with an embedding 1st layer instead.如果展平 5000+ one-hot 编码数组是一个问题,则可以使用嵌入第一层来代替。 Also, what you can do is have a model (defined with the functional API instead of the sequential API as you do) that takes 2 inputs, one for numerical input and another for the categorical data.此外,您可以做的是有一个模型(使用功能 API定义,而不是像您那样使用顺序 API 定义),它需要 2 个输入,一个用于数字输入,另一个用于分类数据。 The categorical data can then go through the embedding and then through a concatenate layer with the numerical input.然后,分类数据可以通过嵌入,然后通过带有数字输入的连接层。 From there on, your model proceeds as you currently do (1024-cell layer...).从那里开始,您的模型将按照您当前的方式进行(1024 单元层...)。

On the off chance you're still working on this, here is something I reused from tensorflow Exception encountered when calling layer (type CategoryEncoding)如果您仍在处理此问题,这是我从调用层时遇到的 tensorflow 异常中重用的内容(类型 CategoryEncoding)

You should be fine as long as you don't change the timestep from 1. I also included the output that yields excellent accuracy... because I reused one of the features for the label.只要您不从 1 更改时间步长,就应该没问题。我还包括了产生出色准确性的输出……因为我重用了标签的一个功能。 Your question was more about how to get categorical and numerical fed into a model.您的问题更多是关于如何将分类和数字输入模型。 You'll take it from there.你会从那里拿走它。

import tensorflow as tf
import pandas as pd
import numpy as np

# Simulate a data set of categorical and numerical values
# Configure simulation specifications: {feature: number of unique categories or None for numerical
theSimSpecs = {'Cat1': 54, 'Cat2': 2, 'Cat3': 4, 'Num1': None, 'Num2': None}

# batch size and timesteps
theBatchSz, theTimeSteps= 10, 1

# Creation of the dataset as pandas.DataFrame
theDFs = []
for theFeature, theUniques in theSimSpecs.items():
    if theUniques is None:
        theDF = pd.DataFrame(np.random.random(size=theBatchSz * theTimeSteps), columns=[theFeature])
    else:
        theDF = pd.DataFrame(np.random.randint(low=0, high=theUniques, size=theBatchSz * theTimeSteps),
                             columns=[theFeature]).astype('category')
    theDFs.append(theDF)
theDF = pd.concat(theDFs, axis=1)

# code excerpt
# inventory of the categorical features' values ( None for the numerical)
theCatCodes = {theCol: (theDF[theCol].unique().tolist() if str(theDF[theCol].dtypes) == "category" else None)
               for theCol in theDF.columns}

# Creation of the batched tensorflow.data.Dataset
theDS = tf.data.Dataset.from_tensor_slices(dict(theDF))
theDS = theDS.window(size=theTimeSteps, shift=1, stride=1, drop_remainder=True)
theDS = theDS.flat_map(lambda x: tf.data.Dataset.zip(x))
theDS = theDS.batch(batch_size=theBatchSz, drop_remainder=True)

# extracting one batch
theBatch = next(iter(theDS))
tf.print(theBatch)

# Creation of the components for the interface layer
theFeaturesInputs = {}
theFeaturesEncoded = {}

for theFeature, theCodes in theCatCodes.items():
    if theCodes is None: # Pass-through for numerical features
        theNumInput = tf.keras.layers.Input(shape=[], dtype=tf.float32, name=theFeature)
        theFeaturesInputs[theFeature] = theNumInput

        theFeatureExp = tf.expand_dims(input=theNumInput, axis=-1)
        theFeaturesEncoded[theFeature] = theFeatureExp

    else: # Process for categorical features
        theCatInput = tf.keras.layers.Input(shape=[], dtype=tf.int64, name=theFeature)
        theFeaturesInputs[theFeature] = theCatInput

        theFeatureExp = tf.expand_dims(input=theCatInput, axis=-1)
        theEncodingLayer = tf.keras.layers.CategoryEncoding(num_tokens=theSimSpecs[theFeature],
                                                            name=f"{theFeature}_enc",
                                                            output_mode="one_hot", sparse=False)
        theFeaturesEncoded[theFeature] = theEncodingLayer(theFeatureExp)

### Below is what you'd be interested in

theStackedInputs = tf.concat(tf.nest.flatten(theFeaturesEncoded), axis=1)

theModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theStackedInputs)
theOutputs = theModel(theBatch)
tf.print(theOutputs[:5], summarize=-1)

x = tf.keras.layers.Dense(1024, activation=tf.nn.relu)(theStackedInputs)
x = tf.keras.layers.Dense(512, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(256, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(128, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(64, activation=tf.nn.relu)(x)
theModelOutputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(x)

theFullModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theModelOutputs)
opt = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)

theFullModel.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

# Yes! I was not super inspired and used one of the categorical columns as the label.
theFullModel.fit(x=theBatch, y=theBatch["Cat2"], epochs=10, verbose=1)

theOutput = theFullModel(theBatch)
tf.print(theOutput, summarize=-1)
# No big whoop, they match.  I only showed how to write the way to get the categorical and numerical
# to be fed to the model in one go
tf.print(theBatch["Cat2"], summarize=-1)

The output:输出:

{'Cat1': [10 49 43 ... 12 30 16], 'Cat2': [0 1 0 ... 0 0 1], 'Cat3': [0 0 3 ... 2 0 3], 'Num1': [0.61139996409794306 0.5939044614577218 0.59720125040942829 ... 0.60667441398817357 0.29892784677577522 0.18648129276910852], 'Num2': [0.10468088989623936 0.18483904850190647 0.053207335764727581 ... 0.065936326328562944     0.50927258084392213 0.14051443214527148]}
[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0.611399949 0.104680888]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0.593904436 0.184839055]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.597201228 0.0532073341]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.668277 0.012965099]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0.594983 0.969921947]]
Epoch 1/10
1/1 [==============================] - 1s 542ms/step - loss: 0.6840 - accuracy: 1.0000
Epoch 2/10
1/1 [==============================] - 0s 3ms/step - loss: 0.6190 - accuracy: 1.0000
Epoch 3/10
1/1 [==============================] - 0s 3ms/step - loss: 0.5387 - accuracy: 1.0000
Epoch 4/10
1/1 [==============================] - 0s 3ms/step - loss: 0.4398 - accuracy: 1.0000
Epoch 5/10
1/1 [==============================] - 0s 3ms/step - loss: 0.3275 - accuracy: 1.0000
Epoch 6/10
1/1 [==============================] - 0s 3ms/step - loss: 0.2174 - accuracy: 1.0000
Epoch 7/10
1/1 [==============================] - 0s 3ms/step - loss: 0.1264 - accuracy: 1.0000
Epoch 8/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0643 - accuracy: 1.0000
Epoch 9/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0280 - accuracy: 1.0000
Epoch 10/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0104 - accuracy: 1.0000
[[0.000557512045]
 [0.990028]
 [0.000891149044]
 [0.000809252262]
 [0.000103618848]
 [0.990051866]
 [4.47019956e-05]
 [0.000348180532]
 [0.000630795956]
 [0.990045428]]
[0 1 0 0 0 1 0 0 0 1]

Process finished with exit code 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM