简体   繁体   English

如何在Keras中使用自定义功能进行文本分类

[英]How use custom features in Keras for text classification

I'm working for a text classificator in Python using Keras. 我正在使用Keras在Python中使用文本分类器。 For now I tried so make model only with the words of my dataset, using bag of words . 现在,我尝试使用bag of words只用数据集的单词来建立模型。 Now I would use in my classifier other custom features (like polarity) but I don't know how to add there in my code. 现在,我将在分类器中使用其他自定义功能(例如极性),但是我不知道如何在代码中添加这些功能。 My dataset is like: 我的数据集就像:

 Text                    | Polarity | Number of words | Classification 

 Hello my name is John   |    0,05  |        5        |        0
 How old are you?        |    0,00  |        4        |        1
 I'm very hungry         |   -0,05  |        4        |        0

The middle two colums are my custom features that i want add to my classifier but I don't know how. 中间的两个列是我想要添加到分类器中的自定义功能,但我不知道如何。

train_x = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')
train_x2 = train_x

train_x = train_x[1000:]
test_x = train_x2[:1000]
train_y = keras.utils.to_categorical(train_y, 2)
train_y2 = train_y
train_y = train_y[1000:]
test_y = train_y2[:1000]


from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

model = Sequential()
model.add(Dense(30, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.45)) 
model.add(Dense(100, activation='softplus'))
model.add(Dropout(0.45))
model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy',optimizer='RMSProp',metrics=['accuracy'])

history = model.fit(train_x,train_y,batch_size=32,epochs=10,verbose=1,validation_split=0.1,shuffle=True)

score = model.evaluate(test_x,test_y, batch_size=128)

In this example i use only bag of words feature of the content f first column and i want add other 2 column like features (polarity, number of words). 在此示例中,我仅使用第一列内容的单词袋功能,我想添加其他两列,例如功能(极性,单词数)。 Someone has an idea how add these? 有人知道如何添加这些? Thanks in advance. 提前致谢。

For Bag of words you can just concatenate your numerical features on top of your BoW vector. 对于字词包,您只需将数字特征连接到BoW向量上即可。 Therefore you can just use numpy, or even easier pandas. 因此,您可以只使用numpy甚至更简单的熊猫。 Then you have a vector with the dimension max_words + custom_numerical_features. 然后,您有一个尺寸为max_words + custom_numerical_features的向量。

Anyway I did somthing similar and worked a lot with several approaches like BoW and embeddings. 无论如何,我确实做了类似的事情,并且在诸如BoW和嵌入之类的几种方法上做了很多工作。

It is a good idea to sperate text features and numerical features in your network. 分散网络中的文本功能和数字功能是一个好主意。 To do so you can use multiple input models. 为此,您可以使用多个输入模型。 I just wrote a blog about it you can take a look here . 我刚刚写了一个博客,你可以在这里看看。 There are embeddings used, but in general it works also for BoW. 有使用的嵌入,但通常它也适用于BoW。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM