简体   繁体   English

Keras总是预测相同的输出

[英]Keras always predicting the same output

Keras will always predict the same class for every input i give hm. Keras将始终为我给出的每个输入预测同一个班级。 There are currently four classes. 目前有四个班级。 News, Weather, Sport and Economy. 新闻,天气,体育和经济。

The training set consists of a lot of different texts, where the class is the same as its topic. 训练集由许多不同的文本组成,其中类与其主题相同。 There are a lot more texts classified as News and Sport, than there are texts for Weather and Economy. 有更多的文本归类为新闻和体育,而不是天气和经济的文本。

  • News: 12112 texts 新闻:12112文本
  • Weather: 1685 texts 天气:1685文本
  • Sport: 13669 texts 体育:13669文
  • economy: 1282 texts 经济:1282文

I would have expected the model to be biased towards Sport and News, but instead it is completely biased towards Weather with every input beeing classified as Weather with at least 80% confidence. 我原本预计该模型会偏向于体育和新闻,但它完全偏向于天气,每个输入都归类为天气,信心至少为80%。

Just to add to my confusion: While training the annotator will reach accuracy scores from 95% to 100%(sic!). 只是为了增加我的困惑:训练注释器时,准确度分数将达到95%到100%(原文如此!)。 I guess I am doing something really stupid here but I don't know what it is. 我想我在做一些非常愚蠢的事情,但我不知道它是什么。

This one is how i call my classifier. 这是我如何称呼我的分类器。 It runs on python 3 on a Windows pc. 它运行在Windows PC上的python 3上。

with open('model.json') as json_data:
model_JSON = json.load(json_data)

model_JSON = json.dumps(model_JSON) 
model = model_from_json(model_JSON)

model.load_weights('weights.h5')

text = str(text.decode())   
encoded = one_hot(text, max_words, split=" ")

tokenizer = Tokenizer(num_words=max_words)
matrix = tokenizer.sequences_to_matrix([encoded], mode='binary')

result = model.predict(matrix)

legende = ["News", "Wetter", "Sport", "Wirtschaft"]
print(str(legende))
print(str(result))

cat = numpy.argmax(result)  
return str(legende[cat]).encode()

This one is how I train my classifier. 这是我训练分类器的方法。 I omitted the part where I fetch the data from a database. 我省略了从数据库中获取数据的部分。 This is done on a Linux VM. 这是在Linux VM上完成的。 I already tried changing the loss and activation around, but nothing happened. 我已经尝试过更改损失和激活,但没有任何反应。 Also I am curently trying to use more epochs but up to now that hasn't helped either. 此外,我正在尝试使用更多的时代,但到目前为止还没有帮助。

max_words = 10000
batch_size=32
epochs=15

rows = cursor.fetchall()

X = []
Y = []

# Einlesen der Rows
for row in rows:
    X.append(row[5])
    Y.append(row[1])

num_classes = len(set(Y))
Y = one_hot("$".join(Y), num_classes, split="$")


for i in range(len(X)):
    X[i] = one_hot(str(X[i]), max_words, split=" ")

split = round(len(X) * 0.2)     

x_test = np.asarray(X[0:int(split)])
y_test = np.asarray(Y[0:int(split)])

x_train = np.asarray(X[int(split):len(X)])
y_train = np.asarray(Y[int(split):len(X)])

print('x_test shape', x_test.shape)
print('y_test shape', y_test.shape)

print(num_classes, 'classes')

#vektorisieren
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

#klassenvektor zu binärer klassenmatrix
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

#model erstellen
model = Sequential()

model.add(Dense(512, input_shape=(max_words,)))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])


history = model.fit(x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    validation_split=0.1
    )

score = model.evaluate(x_test, y_test,
    batch_size=batch_size, 
    verbose=1
    )

print('Test score', score[0])
print('Test accuracy', score[1])

#write model to json
print("writing model to json")
model_json = model.to_json()
with open("model.json", 'w') as json_file:
    json_file.write(model_json)

#save weights as hdf5
print("saving weights to hdf5")
model.save_weights("weights.h5")

Thanks to the tip that @Daniel Möller gave me I found out what the problem was. 感谢@DanielMöller给我的提示,我发现了问题所在。 His tip was to look at how many instances of each Class are contained in your training set. 他的提示是查看训练集中包含每个班级的实例数。

In my case I found out, that hashing your classes with One_Hot is not smart, as it will sometimes encode multiple classes with the same number. 在我的情况下,我发现,使用One_Hot散列您的类并不聪明,因为它有时会编码具有相同数字的多个类。 For me One_Hot encoded nearly everything as a 1. This way Keras learned to only predict 1. 对我而言,One_Hot几乎将所有内容编码为1.这样Keras学会了只预测1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM