简体   繁体   English

如何使用来自 Keras ANN 的学习嵌入层作为 XGBoost model 中的输入特征?

[英]How to use a learned embedding layer from a Keras ANN as an input feature in an XGBoost model?

I am attempting to reduce the dimensionality of a categorical feature by extracting an embedding layer from a neural net and using it as an input feature in a separate XGBoost model.我试图通过从神经网络中提取嵌入层并将其用作单独的 XGBoost model 中的输入特征来降低分类特征的维数。

An embedding layer has the dimensions (nr. unique categories + 1, chosen output size).嵌入层具有维度(nr. unique categories + 1,选择 output 大小)。 How can it be concatenated to the continuous variables in the original training data with the dimensions (nr. observations, nr. features)?如何将其连接到具有维度(nr.观察,nr.特征)的原始训练数据中的连续变量?

Below is a reproducible example of regression with a neural net, in which a categorical feature is encoded as a learned embedding layer.下面是使用神经网络进行回归的可重现示例,其中分类特征被编码为学习嵌入层。 The example is closely adapted from: http://machinelearningmechanic.com/keras/2018/03/09/keras-regression-with-categorical-variable-embeddings-md.html#Define-the-input-layers该示例紧密改编自: http://machinelearningmechanic.com/keras/2018/03/09/keras-regression-with-categorical-variable-embeddings-md.html#Define-the-input-layers

At the end I have printed the embedding layer and its shape.最后,我打印了嵌入层及其形状。 How can this layer be merged with the continuous features in the original training data (X_train_continuous)?该层如何与原始训练数据(X_train_continuous)中的连续特征合并? If the number of rows were equal to the number of categories and if we knew the order in which categories are represented in the embedding layer, the embedding array could perhaps be joined to the training observations on category, but instead the number of rows equals the number of categories + 1 (in the code: len(values) + 1).如果行数等于类别数,并且如果我们知道类别在嵌入层中表示的顺序,则嵌入数组可能会加入到类别的训练观察中,但行数等于类别数 + 1(在代码中:len(values) + 1)。

# Imports and helper functions

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.layers import Input, Embedding, Dense
from keras.models import Model
from keras.callbacks import Callback
import matplotlib.pyplot as plt

# Bayesian Methods for Hackers style sheet
plt.style.use('bmh')

np.random.seed(1234567890)


class PeriodicLogger(Callback):
    """
    A helper callback class that only prints the losses once in 'display' epochs
    """

    def __init__(self, display=100):
        self.display = display

    def on_train_begin(self, logs={}):
        self.epochs = 0

    def on_epoch_end(self, batch, logs={}):
        self.epochs += 1
        if self.epochs % self.display == 0:
            print("Epoch: %d - loss: %f - val_loss: %f" % (
            self.epochs, logs['loss'], logs['val_loss']))


periodic_logger_250 = PeriodicLogger(250)

# Define the mapping and a function that computes the house price for each
# example

per_meter_mapping = {
    'Mercaz': 500,
    'Old North': 350,
    'Florentine': 230
}

per_room_additional_price = {
    'Mercaz': 15. * 10 ** 4,
    'Old North': 8. * 10 ** 4,
    'Florentine': 5. * 10 ** 4
}


def house_price_func(row):
    """
    house_price_func is the function f(a,s,n).

    :param row: dict (contains the keys: ['area', 'size', 'n_rooms'])
    :return: float
    """
    area, size, n_rooms = row['area'], row['size'], row['n_rooms']
    return size * per_meter_mapping[area] + n_rooms * \
           per_room_additional_price[area]

# Create toy data

AREAS = ['Mercaz', 'Old North', 'Florentine']


def create_samples(n_samples):
    """
    Helper method that creates dataset DataFrames

    Note that the np.random.choice call only determines the number of rooms and the size of the house
    (the price, which we calculate later, is deterministic)

    :param n_samples: int (number of samples for each area (suburb))
    :return: pd.DataFrame
    """
    samples = []

    for n_rooms in np.random.choice(range(1, 6), n_samples):
        samples += [(area, int(np.random.normal(25, 5)), n_rooms) for area in
                    AREAS]

    return pd.DataFrame(samples, columns=['area', 'size', 'n_rooms'])

# Create the train and validation sets

train = create_samples(n_samples=1000)
val = create_samples(n_samples=100)

# Calculate the prices for each set

train['price'] = train.apply(house_price_func, axis=1)
val['price'] = val.apply(house_price_func, axis=1)

# Define the features and the y vectors

continuous_cols = ['size', 'n_rooms']
categorical_cols = ['area']
y_col = ['price']

X_train_continuous = train[continuous_cols]
X_train_categorical = train[categorical_cols]
y_train = train[y_col]

X_val_continuous = val[continuous_cols]
X_val_categorical = val[categorical_cols]
y_val = val[y_col]

# Normalization

# Normalizing both train and test sets to have 0 mean and std. of 1 using the
# train set mean and std.
# This will give each feature an equal initial importance and speed up the
# training time

train_mean = X_train_continuous.mean(axis=0)
train_std = X_train_continuous.std(axis=0)

X_train_continuous = X_train_continuous - train_mean
X_train_continuous /= train_std

X_val_continuous = X_val_continuous - train_mean
X_val_continuous /= train_std

# Build a model using a categorical variable
# First let's define a helper class for the categorical variable

class EmbeddingMapping():
    """
    Helper class for handling categorical variables

    An instance of this class should be defined for each categorical variable
    we want to use.
    """

    def __init__(self, series):
        # get a list of unique values
        values = series.unique().tolist()

        # Set a dictionary mapping from values to integer value
        # In our example this will be {'Mercaz': 1, 'Old North': 2,
        # 'Florentine': 3}
        self.embedding_dict = {value: int_value + 1 for int_value, value in
                               enumerate(values)}

        # The num_values will be used as the input_dim when defining the
        # embedding layer.
        # It will also be returned for unseen values
        self.num_values = len(values) + 1

    def get_mapping(self, value):
        # If the value was seen in the training set, return its integer mapping
        if value in self.embedding_dict:
            return self.embedding_dict[value]

        # Else, return the same integer for unseen values
        else:
            return self.num_values

# Create an embedding column for the train/validation sets

area_mapping = EmbeddingMapping(X_train_categorical['area'])

X_train_categorical = \
    X_train_categorical.assign(area_mapping=X_train_categorical['area']
                               .apply(area_mapping.get_mapping))
X_val_categorical = \
    X_val_categorical.assign(area_mapping=X_val_categorical['area']
                             .apply(area_mapping.get_mapping))

# Define the input layers

# Define the embedding input
area_input = Input(shape=(1,), dtype='int32')

# Decide to what vector size we want to map our 'area' variable.
# I'll use 1 here because we only have three areas
embeddings_output = 2

# Let’s define the embedding layer and flatten it
area_embedings = Embedding(output_dim=embeddings_output,
                           input_dim=area_mapping.num_values,
                           input_length=1, name="embedding_layer")(area_input)
area_embedings = keras.layers.Reshape((embeddings_output,))(area_embedings)

# Define the continuous variables input (just like before)
continuous_input = Input(shape=(X_train_continuous.shape[1], ))

# Concatenate continuous and embeddings inputs
all_input = keras.layers.concatenate([continuous_input, area_embedings])

# To merge them together we will use Keras Functional API
# Will define a simple model with 2 hidden layers, with 25 neurons each.

# Define the model
units=25
dense1 = Dense(units=units, activation='relu')(all_input)
dense2 = Dense(units, activation='relu')(dense1)
predictions = Dense(1)(dense2)

# Note using the input object 'area_input' not 'area_embeddings'
model = Model(inputs=[continuous_input, area_input], outputs=predictions)

# Lets train the model

epochs = 100  # to train properly, use 10000
model.compile(loss='mse',
              optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9,
                                              beta_2=0.999, decay=1e-03,
                                              amsgrad=True))

# Note continuous and categorical columns are inserted in the same order as
# defined in all_inputs
history = model.fit([X_train_continuous, X_train_categorical['area_mapping']],
                    y_train, epochs=epochs, batch_size=128, callbacks=[
        periodic_logger_250], verbose=0,
                    validation_data=([X_val_continuous, X_val_categorical[
                        'area_mapping']], y_val))

# Observe the embedding layer

embeddings_output = model.get_layer('embedding_layer').get_weights()[0]

print(f'Embedding layer:\n{embeddings_output}')
print(f'Embedding layer shape: {embeddings_output.shape}')

First, this post has a terminology problem: an "embedding" is the representation of a particular input sample.首先,这篇文章有一个术语问题:“嵌入”是特定输入样本的表示。 It is the vector output by a layer.一层是向量output。 The "weights" are the matrices stored and trained inside the layer. “权重”是在层内存储和训练的矩阵。

In Keras, the Model class is a subclass of Layer.在 Keras 中,Model class 是 Layer 的子类。 You can use any Model as a Layer in a larger model.您可以将任何 Model 用作较大 model 中的层。

You can create a Model with just the Embedding layer, then use it as a layer when building the rest of your model.您可以仅使用嵌入层创建 Model,然后在构建 model 的 rest 时将其用作层。 After training, you can call.predict() on that "sub-model".训练后,您可以在该“子模型”上调用.predict()。 Also, you can save that sub-model out to a json file and reload it later.此外,您可以将该子模型保存到 json 文件并稍后重新加载。

This is the standard technique for creating a model that emits internal embeddings.这是创建发出内部嵌入的 model 的标准技术。

One thing you can do is to run your 'pretrained' model with each layer having a unique name and save it您可以做的一件事是运行您的“预训练” model,每个层都有一个唯一的名称并保存它

Then, create your new model, with the same named layers you want to keep, and use Model.load_weights(file_path, by_name=True)然后,创建新的 model,使用您要保留的相同命名层,并使用 Model.load_weights(file_path, by_name=True)

This will let you keep all of the layers that you want and let you change everything afterwards这将让您保留所有想要的图层,然后让您更改所有内容

To get the embedding layer outputs with shape (nr. samples, chosen output size):要获得具有形状的嵌入层输出(nr. 个样本,选择 output 大小):

intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer("embedding_layer")
                                 .output)
embedding_output = \
    intermediate_layer_model.predict([X_train_continuous,
                                      X_train_categorical['area_mapping']])

print(embedding_output.shape)  # (3000, 1, 2)

intermediate_output = \
    embedding_output.reshape(embedding_output.shape[0], -1)

print(intermediate_output.shape)  # (3000, 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM