检查输入时出错：预期density_1_input具有形状（3773），但数组的形状为（111，）

Question

I want to classify email with keras, I already have folders which contains emails so I want keras to identify a model that predict where to put the non-classiflied email based on what i have already classified. 我想用keras对电子邮件进行分类，我已经有了包含电子邮件的文件夹，因此我希望keras可以根据我已经分类的内容来确定一个模型，该模型可以预测将未分类的电子邮件放在何处。

So i read all the mails and create a dataframe of two columns with panda, one is a list of all the words in the mail and the other the folder where it belongs to. 因此，我阅读了所有邮件，并使用panda创建了两列的数据框，一个是邮件中所有单词的列表，另一个是邮件所属的文件夹。

After that i created x_train , y_train , x_test and y_test to train and evaluate my code. 之后，我创建了x_train ， y_train ， x_test和y_test来训练和评估我的代码。 Which gave me good results, so I wanted to classfied the non-classified emails by doing the same method, read the mail tokenize it and then use pd.get_dummies and then transform it to a numpy array. 这给了我很好的结果，所以我想通过相同的方法对未分类的电子邮件进行分类，读取邮件将其标记化，然后使用pd.get_dummies，然后将其转换为numpy数组。

Because it looks like the predict call can only handle numpy list or numpy array. 因为看起来预测调用只能处理numpy列表或numpy数组。

And here is the issue, the matrix are different beacuse the number of words in the non-classified mail and my dataset are different, that leads to different shapes and so an error and I wanted to know how to solve. 这就是问题，由于未分类邮件中的单词数不同，矩阵也不同，而我的数据集也不同，导致形状不同，因此出现错误，我想知道如何解决。

I tried to use OneHotEncoder but i don't know if it's the way i use it or not but it failed 我尝试使用OneHotEncoder但我不知道这是否是我使用的方式，但失败了

#lst = each row contains all the word of the folder in the list2
#lst2 = each row contains the path to a folder

data = pd.DataFrame(list(zip(lst, lst2)), columns=['text', 'folder'])

train_size = int(len(data) * .8)
train_posts = data['text'][:train_size]
train_tags = data['folder'][:train_size]

test_posts = data['text'][train_size:]
test_tags = data['folder'][train_size:]

model = Sequential()
model.add(Dense(16, input_shape=(vocab_size,)))
model.add(Activation('elu'))
model.add(Dropout(0.2))
model.add(Dense(32))
model.add(Activation('elu'))
model.add(Dropout(0.2))
model.add(Dense(16))
model.add(Activation('elu'))
model.add(Dropout(0.2))
model.add(Dense(num_labels))
model.add(Activation('sigmoid'))
model.summary()

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=100, verbose=1, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)

#read the non-classified mails

sentences = read_files("mail.eml")
sentences = ' '.join(sentences)
sentences = sentences.lower()
salut = unidecode.unidecode(sentences)
salut = text_to_word_sequence(salut)
salut = np.array(pd.get_dummies(salut).values)

pred = model.predict_classes(salut, batch_size=batch_size, verbose=1)

Results of the trainings: 3018/3018 [==============================] - 0s 64us/step - loss: 0.0215 - acc: 0.9949 - val_loss: 0.0217 - val_acc: 0.9950 训练的结果：3018/3018 [==============================]-0s 64us / step-损失：0.0215- acc：0.9949-val_loss：0.0217-val_acc：0.9950

ValueError: Error when checking input: expected dense_1_input to have shape (3773,) but got array with shape (111,) ValueError：检查输入时出错：预期density_1_input具有形状（3773，）但具有形状（111，）的数组

I use 3773 words in total, which i split into x_train and x_test the training lenght is 80% of 3773 so 3018 and the rest (775) goes to the test 我总共使用了3773个单词，我将其分为x_train和x_test，训练长度为3773的80％，因此3018，其余部分（775）进行测试

traning_time (fit) traning_time（适合）

3018/3018 [==============================] - 0s 67us/step - loss: 0.0225 - acc: 0.9950 - val_loss: 0.0221 - val_acc: 0.9950 3018/3018 [==============================]-0s 67us / step-损耗：0.0225-acc：0.9950-val_loss ：0.0221-val_acc：0.9950

test_time (evaluate) test_time（评估）

755/755 [==============================] - 0s 25us/step 755/755 [==============================]-0s 25us / step

result of evaluate 评价结果

Test score: 0.022089334732748024 Test accuracy: 0.9950132541309129 测试成绩：0.022089334732748024测试准确性：0.9950132541309129

I forgot to say that the read_files call is just a function I made that read the file and return a list of all the words in the mail 我忘了说read_files调用只是我所做的一个函数，它读取文件并返回邮件中所有单词的列表

I tested to complete the matrix of lenght 111 by adding as many columns (full of zero) to match the 3773 lenght, this does work but the matrix is for sure false and this is giving me very poor result while i have a high "accuracy" and "val_accuracy" 我测试了通过添加尽可能多的列（全为零）以匹配3773长度来完成长度111矩阵的方法，此方法确实有效，但是矩阵肯定是假的，这给了我非常差的结果，而我的准确性很高”和“ val_accuracy”

Please say any idea you have if you know how to solve it 如果您知道如何解决，请说出您的想法

Answer 1

I solved the problem the lengtg issue between the two matrix was caused because I did not use the same dictionary when I tokenized my non classified mail and the others mails. 我解决了两个矩阵之间的lengtg问题，这是因为当我对未分类的邮件和其他邮件进行标记时，我没有使用同一词典。

So if anyone encounter this problem you need to use the same tokenizer during all the program. 因此，如果有人遇到此问题，则需要在所有程序中使用相同的标记器。

检查输入时出错：预期density_1_input具有形状（3773），但数组的形状为（111，）

问题描述

traning_time (fit) traning_time（适合）

test_time (evaluate) test_time（评估）

result of evaluate 评价结果

1 个解决方案

解决方案1
0 已采纳 2019-08-07 06:09:14

检查输入时出错：预期density_1_input具有形状（3773），但数组的形状为（111，）

问题描述

traning_time (fit) traning_time（适合）

test_time (evaluate) test_time（评估）

result of evaluate 评价结果

1 个解决方案

解决方案1 0 已采纳 2019-08-07 06:09:14

解决方案1
0 已采纳 2019-08-07 06:09:14