簡體   English   中英

將數值和分類數據混合到具有密集層的 keras 序列模型中

[英]Mixing numerical and categorical data into keras sequential model with Dense layers

我在 Pandas 數據框中有一個訓練集,我將這個數據框傳遞給帶有df.valuesmodel.fit() 以下是有關 df 的一些信息:

df.values.shape
# (981, 5)

df.values[0]
# array([163, 0.6, 83, 0.52,
#       array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0])], dtype=object)

如您所見,df 中的行包含 5 列,其中 4 列包含數值(int 或 float),還有一個包含表示一些分類數據的熱編碼數組。 我正在創建我的 keras 模型,如下所示:

model = keras.Sequential([
    keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(256, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)

model.compile(optimizer=opt, 
      loss='binary_crossentropy',
      metrics=['accuracy'])

model.fit(df.values, df_labels.values, epochs=10, batch_size=32, verbose=0)

df_labels.values只是一個 0 和 1 的一維數組。 所以我相信我最后確實需要一個 Dense(1) sigmoid 層,以及“binary_crossentropy”損失。

如果我只傳遞數字數據,這個模型效果很好。 但是一旦我引入熱編碼(分類數據),我就會得到這個錯誤:

ValueError                                Traceback (most recent call last)
<ipython-input-91-b5e6232b375f> in <module>
     42     #trn_values = df_training_set.values[:,:,len(df_training_set.columns)]
     43     #trn_cat = df_trn_wtid.values.reshape(-1, 1)
---> 44     model.fit(df_training_set.values, df_training_labels.values, epochs=10, batch_size=32, verbose=0)
     45 
     46     #test_loss, test_acc = model.evaluate(df_test_set.values, df_test_labels.values)

~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1037                                         initial_epoch=initial_epoch,
   1038                                         steps_per_epoch=steps_per_epoch,
-> 1039                                         validation_steps=validation_steps)
   1040 
   1041     def evaluate(self, x=None, y=None,

~\Anaconda3\lib\site-packages\keras\engine\training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    197                     ins_batch[i] = ins_batch[i].toarray()
    198 
--> 199                 outs = f(ins_batch)
    200                 outs = to_list(outs)
    201                 for l, o in zip(out_labels, outs):

~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
   2653                 array_vals.append(
   2654                     np.asarray(value,
-> 2655                                dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
   2656         if self.feed_dict:
   2657             for key in sorted(self.feed_dict.keys()):

~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: setting an array element with a sequence.

請不要建議將 one_hot 數組中的每個值展開到它們自己的列中。 此示例是我的數據集的精簡版本,其中包含 6-8 個分類列,其中一些 one_hots 是 5000+ 大小的數組。 所以這對我來說不是一個可行的解決方案。 我希望改進我的順序模型(或徹底檢查 keras 模型),以便處理分類數據和數值數據。

請記住,訓練標簽是 0/1 值的一維數組。 我需要預測一組結果的數值/分類訓練集,我不能有一組來自數值數據的預測和一組來自分類數據的預測。

如果展平 5000+ one-hot 編碼數組是一個問題,則可以使用嵌入第一層來代替。 此外,您可以做的是有一個模型(使用功能 API定義,而不是像您那樣使用順序 API 定義),它需要 2 個輸入,一個用於數字輸入,另一個用於分類數據。 然后,分類數據可以通過嵌入,然后通過帶有數字輸入的連接層。 從那里開始,您的模型將按照您當前的方式進行(1024 單元層...)。

如果您仍在處理此問題,這是我從調用層時遇到的 tensorflow 異常中重用的內容(類型 CategoryEncoding)

只要您不從 1 更改時間步長,就應該沒問題。我還包括了產生出色准確性的輸出……因為我重用了標簽的一個功能。 您的問題更多是關於如何將分類和數字輸入模型。 你會從那里拿走它。

import tensorflow as tf
import pandas as pd
import numpy as np

# Simulate a data set of categorical and numerical values
# Configure simulation specifications: {feature: number of unique categories or None for numerical
theSimSpecs = {'Cat1': 54, 'Cat2': 2, 'Cat3': 4, 'Num1': None, 'Num2': None}

# batch size and timesteps
theBatchSz, theTimeSteps= 10, 1

# Creation of the dataset as pandas.DataFrame
theDFs = []
for theFeature, theUniques in theSimSpecs.items():
    if theUniques is None:
        theDF = pd.DataFrame(np.random.random(size=theBatchSz * theTimeSteps), columns=[theFeature])
    else:
        theDF = pd.DataFrame(np.random.randint(low=0, high=theUniques, size=theBatchSz * theTimeSteps),
                             columns=[theFeature]).astype('category')
    theDFs.append(theDF)
theDF = pd.concat(theDFs, axis=1)

# code excerpt
# inventory of the categorical features' values ( None for the numerical)
theCatCodes = {theCol: (theDF[theCol].unique().tolist() if str(theDF[theCol].dtypes) == "category" else None)
               for theCol in theDF.columns}

# Creation of the batched tensorflow.data.Dataset
theDS = tf.data.Dataset.from_tensor_slices(dict(theDF))
theDS = theDS.window(size=theTimeSteps, shift=1, stride=1, drop_remainder=True)
theDS = theDS.flat_map(lambda x: tf.data.Dataset.zip(x))
theDS = theDS.batch(batch_size=theBatchSz, drop_remainder=True)

# extracting one batch
theBatch = next(iter(theDS))
tf.print(theBatch)

# Creation of the components for the interface layer
theFeaturesInputs = {}
theFeaturesEncoded = {}

for theFeature, theCodes in theCatCodes.items():
    if theCodes is None: # Pass-through for numerical features
        theNumInput = tf.keras.layers.Input(shape=[], dtype=tf.float32, name=theFeature)
        theFeaturesInputs[theFeature] = theNumInput

        theFeatureExp = tf.expand_dims(input=theNumInput, axis=-1)
        theFeaturesEncoded[theFeature] = theFeatureExp

    else: # Process for categorical features
        theCatInput = tf.keras.layers.Input(shape=[], dtype=tf.int64, name=theFeature)
        theFeaturesInputs[theFeature] = theCatInput

        theFeatureExp = tf.expand_dims(input=theCatInput, axis=-1)
        theEncodingLayer = tf.keras.layers.CategoryEncoding(num_tokens=theSimSpecs[theFeature],
                                                            name=f"{theFeature}_enc",
                                                            output_mode="one_hot", sparse=False)
        theFeaturesEncoded[theFeature] = theEncodingLayer(theFeatureExp)

### Below is what you'd be interested in

theStackedInputs = tf.concat(tf.nest.flatten(theFeaturesEncoded), axis=1)

theModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theStackedInputs)
theOutputs = theModel(theBatch)
tf.print(theOutputs[:5], summarize=-1)

x = tf.keras.layers.Dense(1024, activation=tf.nn.relu)(theStackedInputs)
x = tf.keras.layers.Dense(512, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(256, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(128, activation=tf.nn.relu)(x)
x = tf.keras.layers.Dense(64, activation=tf.nn.relu)(x)
theModelOutputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(x)

theFullModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theModelOutputs)
opt = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)

theFullModel.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

# Yes! I was not super inspired and used one of the categorical columns as the label.
theFullModel.fit(x=theBatch, y=theBatch["Cat2"], epochs=10, verbose=1)

theOutput = theFullModel(theBatch)
tf.print(theOutput, summarize=-1)
# No big whoop, they match.  I only showed how to write the way to get the categorical and numerical
# to be fed to the model in one go
tf.print(theBatch["Cat2"], summarize=-1)

輸出:

{'Cat1': [10 49 43 ... 12 30 16], 'Cat2': [0 1 0 ... 0 0 1], 'Cat3': [0 0 3 ... 2 0 3], 'Num1': [0.61139996409794306 0.5939044614577218 0.59720125040942829 ... 0.60667441398817357 0.29892784677577522 0.18648129276910852], 'Num2': [0.10468088989623936 0.18483904850190647 0.053207335764727581 ... 0.065936326328562944     0.50927258084392213 0.14051443214527148]}
[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0.611399949 0.104680888]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0.593904436 0.184839055]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.597201228 0.0532073341]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0.668277 0.012965099]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0.594983 0.969921947]]
Epoch 1/10
1/1 [==============================] - 1s 542ms/step - loss: 0.6840 - accuracy: 1.0000
Epoch 2/10
1/1 [==============================] - 0s 3ms/step - loss: 0.6190 - accuracy: 1.0000
Epoch 3/10
1/1 [==============================] - 0s 3ms/step - loss: 0.5387 - accuracy: 1.0000
Epoch 4/10
1/1 [==============================] - 0s 3ms/step - loss: 0.4398 - accuracy: 1.0000
Epoch 5/10
1/1 [==============================] - 0s 3ms/step - loss: 0.3275 - accuracy: 1.0000
Epoch 6/10
1/1 [==============================] - 0s 3ms/step - loss: 0.2174 - accuracy: 1.0000
Epoch 7/10
1/1 [==============================] - 0s 3ms/step - loss: 0.1264 - accuracy: 1.0000
Epoch 8/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0643 - accuracy: 1.0000
Epoch 9/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0280 - accuracy: 1.0000
Epoch 10/10
1/1 [==============================] - 0s 3ms/step - loss: 0.0104 - accuracy: 1.0000
[[0.000557512045]
 [0.990028]
 [0.000891149044]
 [0.000809252262]
 [0.000103618848]
 [0.990051866]
 [4.47019956e-05]
 [0.000348180532]
 [0.000630795956]
 [0.990045428]]
[0 1 0 0 0 1 0 0 0 1]

Process finished with exit code 0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM