简体   繁体   English

使用CNN模型进行文本分类的预测脚本中的错误

[英]Error in prediction script using CNN model for text classification

I am try to write prediction part of script for the tutorial: https://mxnet.incubator.apache.org/tutorials/nlp/cnn.html 我尝试为该教程编写脚本的预测部分: https : //mxnet.incubator.apache.org/tutorials/nlp/cnn.html

import mxnet as mx

from collections import Counter
import os
import re
import threading
import sys
import itertools
import numpy as np

from collections import namedtuple

SENTENCES_DIR = 'C:/code/mxnet/sentences'
CURRENT_DIR = 'C:/code/mxnet'

def clean_str(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

def load_data_sentences(filename):
    sentences_file = open( filename, "r")
    # Tokenize
    x_text = [line.decode('Latin1').strip() for line in sentences_file.readlines()] 
    x_text = [clean_str(sent).split(" ") for sent in x_text]
    return x_text


def pad_sentences(sentences, padding_word=""):"
    sequence_length = max(len(x) for x in sentences)
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence)
        new_sentence = sentence + [padding_word] * num_padding
        padded_sentences.append(new_sentence)
    return padded_sentences


def build_vocab(sentences):
    word_counts = Counter(itertools.chain(*sentences))
    vocabulary_inv = [x[0] for x in word_counts.most_common()]
    vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
    return vocabulary, vocabulary_inv

def build_input_data(sentences, vocabulary):
    x = np.array([
            [vocabulary[word] for word in sentence]
            for sentence in sentences])
    return x

def predict(mod, sen):
    mod.forward(Batch(data=[mx.nd.array(sen)]))
    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]    
    for i in a[0:5]:
        print('probability=%f' %(prob[i]))   


sentences = load_data_sentences( os.path.join( SENTENCES_DIR, 'test-pos-1.txt') )
sentences_padded = pad_sentences(sentences)
vocabulary, vocabulary_inv = build_vocab(sentences_padded)
x = build_input_data(sentences_padded, vocabulary)


Batch = namedtuple('Batch', ['data'])

sym, arg_params, aux_params = mx.model.load_checkpoint( os.path.join( CURRENT_DIR, 'cnn'), 19)
mod = mx.mod.Module(symbol=sym, context=mx.cpu(), label_names = None)
mod.bind(for_training=False, data_shapes=[('data', (50,56))], label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

predict(mod, x)

But I got the error: 但是我得到了错误:

infer_shape error. infer_shape错误。 Arguments: data: (50, 26L) Traceback (most recent call last): File "C:\\code\\mxnet\\test2.py", line 152, in predict(mod, x) File "C:\\code\\mxnet\\test2.py", line 123, in predict mod.forward(Batch(data=[mx.nd.array(sen)])) ... 参数:数据:(50、26L)回溯(最近一次调用为最新):文件“ C:\\ code \\ mxnet \\ test2.py”,第152行,位于predict(mod,x)文件“ C:\\ code \\ mxnet \\ test2.py“,预测mod.forward(Batch(data = [mx.nd.array(sen)]))中的第123行...

MXNetError: Error in operator reshape0: [16:20:21] c:\\projects\\mxnet-distro-win\\mxnet-build\\src\\operator\\tensor./matrix_op-inl.h:187: Check failed: oshape.Size() == dshape.Size() (840000 vs. 390000) Target shape size is different to source. MXNetError:运算符重整0中的错误:[16:20:21] c:\\ projects \\ mxnet-distro-win \\ mxnet-build \\ src \\ operator \\ tensor./matrix_op-inl.h:187:检查失败:oshape.Size ()== dshape.Size()(840000与390000)目标形状大小与源形状不同。 Target: [50,1,56,300] Source: [50,26,300] 目标:[50,1,56,300]来源:[50,26,300]

Source is text file with 50 strings of sentences 源是具有50个字符串字符串的文本文件

Unfortunately I didn't found any help in Internet. 不幸的是,我没有在Internet上找到任何帮助。 Please take a look. 请看一下。 OS: Windows 10. Python 2.7 Thank you. 操作系统:Windows10。Python 2.7谢谢。

I believe the error you're having is because the padding of your input sentences is different than what the model expects. 我相信您遇到的错误是因为输入句子的填充与模型预期的不同。 The way pad_sentences works is to pad the sentences to the length of the longest sentence passed in, so if you're using a different data set, you'll almost certainly get a different padding than your model's padding (which is 56). pad_sentences的工作方式是将句子填充到传入的最长句子的长度,因此,如果您使用其他数据集,则几乎可以肯定会获得与模型填充(56)不同的填充。 In this case, it looks like you're getting a padding of 26 (From the error message 'Source: [50, 26, 300]'). 在这种情况下,您看起来填充为26(来自错误消息“源:[50、26、300]”)。

I was able to get your code to run successfully by modifying pad_sentence as follows and running it with sequence_length=56 to match the model. 通过如下修改pad_sentence并以sequence_length = 56使其运行以匹配模型,可以使您的代码成功运行。

def pad_sentences(sentences, sequence_length, padding_word=""):
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence)
        new_sentence = sentence + [padding_word] * num_padding
        padded_sentences.append(new_sentence)
    return padded_sentences

NB when you do get your successful run, you'll encounter an error because prob[i] is not a float. 注意,当您成功运行时,会遇到错误,因为prob [i]不是浮点数。

def predict(mod, sen):
    mod.forward(Batch(data=[mx.nd.array(sen)]))
    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]    
    for i in a[0:5]:
        print('probability=%f' %(prob[i]))   << prob is a numpy.ndarray, not a float.

Vishaal Vishaal

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM