如何使用來自 Keras ANN 的學習嵌入層作為 XGBoost model 中的輸入特征？

Question

我試圖通過從神經網絡中提取嵌入層並將其用作單獨的 XGBoost model 中的輸入特征來降低分類特征的維數。

嵌入層具有維度（nr. unique categories + 1，選擇 output 大小）。 如何將其連接到具有維度（nr.觀察，nr.特征）的原始訓練數據中的連續變量？

下面是使用神經網絡進行回歸的可重現示例，其中分類特征被編碼為學習嵌入層。 該示例緊密改編自： http://machinelearningmechanic.com/keras/2018/03/09/keras-regression-with-categorical-variable-embeddings-md.html#Define-the-input-layers

最后，我打印了嵌入層及其形狀。 該層如何與原始訓練數據（X_train_continuous）中的連續特征合並？ 如果行數等於類別數，並且如果我們知道類別在嵌入層中表示的順序，則嵌入數組可能會加入到類別的訓練觀察中，但行數等於類別數 + 1（在代碼中：len(values) + 1）。

# Imports and helper functions

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.layers import Input, Embedding, Dense
from keras.models import Model
from keras.callbacks import Callback
import matplotlib.pyplot as plt

# Bayesian Methods for Hackers style sheet
plt.style.use('bmh')

np.random.seed(1234567890)


class PeriodicLogger(Callback):
    """
    A helper callback class that only prints the losses once in 'display' epochs
    """

    def __init__(self, display=100):
        self.display = display

    def on_train_begin(self, logs={}):
        self.epochs = 0

    def on_epoch_end(self, batch, logs={}):
        self.epochs += 1
        if self.epochs % self.display == 0:
            print("Epoch: %d - loss: %f - val_loss: %f" % (
            self.epochs, logs['loss'], logs['val_loss']))


periodic_logger_250 = PeriodicLogger(250)

# Define the mapping and a function that computes the house price for each
# example

per_meter_mapping = {
    'Mercaz': 500,
    'Old North': 350,
    'Florentine': 230
}

per_room_additional_price = {
    'Mercaz': 15. * 10 ** 4,
    'Old North': 8. * 10 ** 4,
    'Florentine': 5. * 10 ** 4
}


def house_price_func(row):
    """
    house_price_func is the function f(a,s,n).

    :param row: dict (contains the keys: ['area', 'size', 'n_rooms'])
    :return: float
    """
    area, size, n_rooms = row['area'], row['size'], row['n_rooms']
    return size * per_meter_mapping[area] + n_rooms * \
           per_room_additional_price[area]

# Create toy data

AREAS = ['Mercaz', 'Old North', 'Florentine']


def create_samples(n_samples):
    """
    Helper method that creates dataset DataFrames

    Note that the np.random.choice call only determines the number of rooms and the size of the house
    (the price, which we calculate later, is deterministic)

    :param n_samples: int (number of samples for each area (suburb))
    :return: pd.DataFrame
    """
    samples = []

    for n_rooms in np.random.choice(range(1, 6), n_samples):
        samples += [(area, int(np.random.normal(25, 5)), n_rooms) for area in
                    AREAS]

    return pd.DataFrame(samples, columns=['area', 'size', 'n_rooms'])

# Create the train and validation sets

train = create_samples(n_samples=1000)
val = create_samples(n_samples=100)

# Calculate the prices for each set

train['price'] = train.apply(house_price_func, axis=1)
val['price'] = val.apply(house_price_func, axis=1)

# Define the features and the y vectors

continuous_cols = ['size', 'n_rooms']
categorical_cols = ['area']
y_col = ['price']

X_train_continuous = train[continuous_cols]
X_train_categorical = train[categorical_cols]
y_train = train[y_col]

X_val_continuous = val[continuous_cols]
X_val_categorical = val[categorical_cols]
y_val = val[y_col]

# Normalization

# Normalizing both train and test sets to have 0 mean and std. of 1 using the
# train set mean and std.
# This will give each feature an equal initial importance and speed up the
# training time

train_mean = X_train_continuous.mean(axis=0)
train_std = X_train_continuous.std(axis=0)

X_train_continuous = X_train_continuous - train_mean
X_train_continuous /= train_std

X_val_continuous = X_val_continuous - train_mean
X_val_continuous /= train_std

# Build a model using a categorical variable
# First let's define a helper class for the categorical variable

class EmbeddingMapping():
    """
    Helper class for handling categorical variables

    An instance of this class should be defined for each categorical variable
    we want to use.
    """

    def __init__(self, series):
        # get a list of unique values
        values = series.unique().tolist()

        # Set a dictionary mapping from values to integer value
        # In our example this will be {'Mercaz': 1, 'Old North': 2,
        # 'Florentine': 3}
        self.embedding_dict = {value: int_value + 1 for int_value, value in
                               enumerate(values)}

        # The num_values will be used as the input_dim when defining the
        # embedding layer.
        # It will also be returned for unseen values
        self.num_values = len(values) + 1

    def get_mapping(self, value):
        # If the value was seen in the training set, return its integer mapping
        if value in self.embedding_dict:
            return self.embedding_dict[value]

        # Else, return the same integer for unseen values
        else:
            return self.num_values

# Create an embedding column for the train/validation sets

area_mapping = EmbeddingMapping(X_train_categorical['area'])

X_train_categorical = \
    X_train_categorical.assign(area_mapping=X_train_categorical['area']
                               .apply(area_mapping.get_mapping))
X_val_categorical = \
    X_val_categorical.assign(area_mapping=X_val_categorical['area']
                             .apply(area_mapping.get_mapping))

# Define the input layers

# Define the embedding input
area_input = Input(shape=(1,), dtype='int32')

# Decide to what vector size we want to map our 'area' variable.
# I'll use 1 here because we only have three areas
embeddings_output = 2

# Let’s define the embedding layer and flatten it
area_embedings = Embedding(output_dim=embeddings_output,
                           input_dim=area_mapping.num_values,
                           input_length=1, name="embedding_layer")(area_input)
area_embedings = keras.layers.Reshape((embeddings_output,))(area_embedings)

# Define the continuous variables input (just like before)
continuous_input = Input(shape=(X_train_continuous.shape[1], ))

# Concatenate continuous and embeddings inputs
all_input = keras.layers.concatenate([continuous_input, area_embedings])

# To merge them together we will use Keras Functional API
# Will define a simple model with 2 hidden layers, with 25 neurons each.

# Define the model
units=25
dense1 = Dense(units=units, activation='relu')(all_input)
dense2 = Dense(units, activation='relu')(dense1)
predictions = Dense(1)(dense2)

# Note using the input object 'area_input' not 'area_embeddings'
model = Model(inputs=[continuous_input, area_input], outputs=predictions)

# Lets train the model

epochs = 100  # to train properly, use 10000
model.compile(loss='mse',
              optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9,
                                              beta_2=0.999, decay=1e-03,
                                              amsgrad=True))

# Note continuous and categorical columns are inserted in the same order as
# defined in all_inputs
history = model.fit([X_train_continuous, X_train_categorical['area_mapping']],
                    y_train, epochs=epochs, batch_size=128, callbacks=[
        periodic_logger_250], verbose=0,
                    validation_data=([X_val_continuous, X_val_categorical[
                        'area_mapping']], y_val))

# Observe the embedding layer

embeddings_output = model.get_layer('embedding_layer').get_weights()[0]

print(f'Embedding layer:\n{embeddings_output}')
print(f'Embedding layer shape: {embeddings_output.shape}')

Answer 1

首先，這篇文章有一個術語問題：“嵌入”是特定輸入樣本的表示。 一層是向量output。 “權重”是在層內存儲和訓練的矩陣。

在 Keras 中，Model class 是 Layer 的子類。 您可以將任何 Model 用作較大 model 中的層。

您可以僅使用嵌入層創建 Model，然后在構建 model 的 rest 時將其用作層。 訓練后，您可以在該“子模型”上調用.predict()。 此外，您可以將該子模型保存到 json 文件並稍后重新加載。

這是創建發出內部嵌入的 model 的標准技術。

Answer 2

您可以做的一件事是運行您的“預訓練” model，每個層都有一個唯一的名稱並保存它

然后，創建新的 model，使用您要保留的相同命名層，並使用 Model.load_weights(file_path, by_name=True)

這將讓您保留所有想要的圖層，然后讓您更改所有內容

Answer 3

要獲得具有形狀的嵌入層輸出（nr. 個樣本，選擇 output 大小）：

intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer("embedding_layer")
                                 .output)
embedding_output = \
    intermediate_layer_model.predict([X_train_continuous,
                                      X_train_categorical['area_mapping']])

print(embedding_output.shape)  # (3000, 1, 2)

intermediate_output = \
    embedding_output.reshape(embedding_output.shape[0], -1)

print(intermediate_output.shape)  # (3000, 2)

如何使用來自 Keras ANN 的學習嵌入層作為 XGBoost model 中的輸入特征？

問題描述

3 個解決方案

解決方案1
1 已采納 2021-05-20 06:47:00

解決方案2
0 2021-05-19 20:56:05

解決方案3
0 2021-05-20 11:55:27

如何使用來自 Keras ANN 的學習嵌入層作為 XGBoost model 中的輸入特征？

問題描述

3 個解決方案

解決方案1 1 已采納 2021-05-20 06:47:00

解決方案2 0 2021-05-19 20:56:05

解決方案3 0 2021-05-20 11:55:27

解決方案1
1 已采納 2021-05-20 06:47:00

解決方案2
0 2021-05-19 20:56:05

解決方案3
0 2021-05-20 11:55:27