简体繁体 English

处理多个类别输入和可变大小的组作为神经网络的输入

[英]Dealing with multiple categorical inputs and variable-sized groups as inputs to neural network

原文 2019-07-30 19:55:38 8 1 python/ machine-learning/ keras/ neural-network/ embedding

I'm working with data which consists of numerical and categorical features, where each input consists of a variable-sized group of the features. 我正在使用包含数字和分类特征的数据，其中每个输入都包含一组可变大小的特征。 For example: predict the price of a house by using features about each room in the house, and each house could have a different amount of rooms. 例如：通过使用房屋中每个房间的特征来预测房屋的价格，并且每个房屋可能拥有不同数量的房间。 The features could be size in meters, type (eg living room/bathroom/bedroom), color, floor... Some of the categorical features have high cardinality, and I may be using many features. 这些功能可能是以米为单位的大小，类型（例如，客厅/浴室/卧室），颜色，地板...一些分类功能具有很高的基数，我可能正在使用许多功能。 I'd want to use the features from n rooms to predict the price for each house. 我想使用n个房间的功能来预测每个房子的价格。 How would I structure my inputs/nn model to receive variable-sized groups of inputs? 如何构造输入/ nn模型以接收可变大小的输入组？

I thought of using one-hot encoding, but then I'd end up with large input vectors and I'd lose the connections between the features for each room. 我曾想过使用单点编码，但是最终我得到了很大的输入向量，并且失去了每个房间要素之间的联系。 I also thought of using embeddings, but I'm not sure what the best way is to combine the features/samples to properly input all the data without losing any info about which features come from which samples etc. 我也考虑过使用嵌入，但是我不确定最好的方法是组合特征/样本以正确输入所有数据，而不会丢失有关哪些特征来自哪些样本等的任何信息。

1 个解决方案

As the article, linked below, suggests... you've got one of three routes to choose from. 正如下面链接的文章所建议的那样……您已从以下三种路线中选择一种。

Ordinal Encoding which I am thinking is not the right use case for your example 我认为序数编码不是您的示例的正确用例
One Hot Encoding which you've ruled out efficiently. 您已有效排除的一种热编码。
Difference Encoding, which is I think a little bit suited as there are master bedrooms, minor ones, guest ones and children ones. 差异编码，我觉得有点合适，因为有主卧，小卧，客卧和儿童卧。 So, try that angle. 因此，尝试该角度。