简体   繁体   English

训练集包含“标签”作为keras模型的输入

[英]Training set contains “labels” as inputs to keras model

I'm seeing that my keras model does not handle input columns well if they are not float values. 我看到我的keras模型不能很好地处理输入列(如果它们不是浮点值)。 I'd like to be able to train the model using columns that contain "labels", and by labels I mean IDs of sorts, or encoded string names. 我希望能够使用包含“标签”的列来训练模型,而标签指的是各种ID或编码的字符串名称。 Ideally it would be able to integrate these label columns into its model, deciding which values within these categorical columns predicate a higher accuracy. 理想情况下,它将能够将这些标签列集成到其模型中,从而确定这些分类列中的哪些值表示较高的准确性。

For example, I'm trying to predict the outcomes of a competition (Win=1, Loss=0) and I'd like to include "team name" and "coach name" in the historical data. 例如,我试图预测比赛的结果(胜利= 1,亏损= 0),并且我想在历史数据中包括“团队名称”和“教练名称”。 Ideally the model would identify which teams and coaches are more likely to win. 理想情况下,该模型将确定哪些球队和教练更有可能获胜。

However, when I run model.fit and the training_set includes anything other than int/float values (that are statistical in nature, not categorical), it generates the same accuracy for each epoch with a very high loss score. 但是,当我运行model.fit且training_set包含除int / float值之外的任何内容(本质上是统计性的,不是分类的),它为每个纪元生成相同的准确性,并且损失得分非常高。

Here is how I defined my model: 这是我定义模型的方式:

model = keras.Sequential([
        keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_ru, bias_initializer=init_ru),
        keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_ru, bias_initializer=init_ru),
        keras.layers.Dense(256, activation=tf.nn.relu),
        keras.layers.Dense(128, activation=tf.nn.relu),
        keras.layers.Dense(32, activation=tf.nn.relu),
        keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)

model.compile(optimizer=opt, 
              loss='binary_crossentropy',
              metrics=['accuracy'])

It works great if I don't include any categorical data, but I think if I could get it to work with categorical data, it would improve even more. 如果我不包含任何分类数据,它会很好用,但是我认为,如果我可以将其与分类数据一起使用,它将进一步改善。

The standard way to handle categorical data is to create a dictionary of valid values and then convert the category into a one_hot vector. 处理分类数据的标准方法是创建有效值的字典,然后将类别转换为one_hot向量。

This is a reasonable introductory article with examples: https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/ 这是一个带有示例的合理介绍性文章: https : //machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

Supposing your independent variables (features) are in a dataframe df you can use: 假设您的自变量(功能)在数据框df ,则可以使用:

pd.get_dummies(df.iloc[:,columns_to_be_converted])

An example with numpy array: numpy数组的示例:

pd.get_dummies(np.array(["Mark","Sarah","Mark","John"]).astype(str))

Ouput: 输出:

   John  Mark  Sarah
0     0     1      0
1     0     0      1
2     0     1      0
3     1     0      0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM