简体   繁体   English

如何将 gensim 与 pytorch 结合使用来创建意图分类器(使用 LSTM NN)?

[英]How to use gensim with pytorch to create an intent classifier (With LSTM NN)?

The problem to solve: Given a sentence, return the intent behind it (Think chatbot)要解决的问题:给定一个句子,返回其背后的意图(想想聊天机器人)

Reduced example dataset (Intent on the left of dict):简化示例数据集(意图在字典左侧):

data_raw    = {"mk_reservation" : ["i want to make a reservation",
                                   "book a table for me"],
               "show_menu"      : ["what's the daily menu",
                                   "do you serve pizza"],
               "payment_method" : ["how can i pay",
                                   "can i use cash"],
               "schedule_info"  : ["when do you open",
                                   "at what time do you close"]}

I have stripped down the sentences with spaCy, and tokenized each word by using the word2vec algorithm provided by the gensim library.我用 spaCy 精简了句子,并使用 gensim 库提供的word2vec算法对每个单词进行了标记。

This is what resulted from the use of word2vec model GoogleNews-vectors-negative300.bin:这是使用 word2vec 模型 GoogleNews-vectors-negative300.bin 的结果:

[[[ 5.99331968e-02  6.50703311e-02  5.03010787e-02 ... -8.00536275e-02
    1.94782894e-02 -1.83010306e-02]
  [-2.14406010e-02 -1.00447744e-01  6.13847338e-02 ... -6.72588721e-02
    3.03986594e-02 -4.14126664e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[ 4.48647663e-02 -1.03907576e-02 -1.78682189e-02 ...  3.84555124e-02
   -2.29179319e-02 -2.05144612e-03]
  [-5.39291985e-02 -9.88398306e-03  4.39085700e-02 ... -3.55276838e-02
   -3.66208404e-02 -4.57760505e-03]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]]
  • This is a List of sentences, and each sentence is a list of words ( [sentences[sentence[word]]] )这是一个句子列表,每个句子都是一个单词列表 ( [sentences[sentence[word]]] )
  • Each sentence (list) must be of size 10 words (I am padding the remaining with zeroes)每个句子(列表)的大小必须为 10 个单词(我用零填充剩余部分)
  • Each word (list) has 300 elements (word2vec dimensions)每个单词(列表)有 300 个元素(word2vec 维度)

By following some tutorials i transformed this to a TensorDataset.通过遵循一些教程,我将其转换为 TensorDataset。

At this moment, i am very confused on how to use the word2vec and probably i have just been wasting time, as of now i believe the embeddings layer from an LSTM configuration should be composed by importing the word2vec model weights using:此刻,我对如何使用 word2vec 感到非常困惑,可能我只是在浪费时间,截至目前,我认为 LSTM 配置中的嵌入层应该通过使用以下方法导入 word2vec 模型权重来组成:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)    
word_embeddings = nn.Embedding.from_pretrained(weights)

This is not enough as pytorch is saying it does not accept embeddings where indices are not INT type.这还不够,因为 pytorch 说它不接受索引不是 INT 类型的嵌入。

EDIT: I found out that importing the weight matrix from gensim word2vec is not straightforward, one has to import the word_index table as well.编辑:我发现从 gensim word2vec 导入权重矩阵并不简单,还必须导入 word_index 表。

As soon as i fix this issue i'll post it here.一旦我解决了这个问题,我就会把它贴在这里。

You don't need neither a neural network nor word embeddings.您既不需要神经网络也不需要词嵌入。 Use parsed trees with NLTK, where intents are Verbs V acting on entities (N) in a given utterance :在 NLTK 中使用解析树,其中意图是作用于给定utterance中的entities (N) 的动词V

短语

To classify a sentence, then you can use a Neural Net.要对一个句子进行分类,那么你可以使用神经网络。 I personally like BERT in fast.ai.我个人比较喜欢 fast.ai 中的 BERT。 Once again, you won't need embeddings to run the classification, and you can do it in multilanguage:再一次,你不需要嵌入来运行分类,你可以用多语言来做:

Fast.ai_BERT_ULMFit Fast.ai_BERT_ULMFit

Also, you can use Named Entity Recognition if you are working on a chatbot, to guide conversations.此外,如果您正在使用聊天机器人,则可以使用Named Entity Recognition来指导对话。

If you have enough training data, you may not need fancy neural networks (or even explicit word-vectorization).如果你有足够的训练数据,你可能不需要花哨的神经网络(甚至不需要显式的词向量化)。 Just try basic text-classification algorithms (for example from scikit-learn ) against basic text representations (such as a simple bag-of-words or bag-of-character n-grams).只需针对基本文本表示(例如简单的词袋或字符袋 n-gram)尝试基本文本分类算法(例如来自scikit-learn算法)。

If those don't work, or fail when confronted with novel words, then you might try fancier text vectorization options.如果这些方法不起作用,或者在遇到新词时失败了,那么您可以尝试更高级的文本矢量化选项。 For example, you might replace unknown words with the nearest-known-word from a large word2vec model.例如,您可以用大型 word2vec 模型中最近的已知词替换未知词。 Or representing queries as averages-of-word-vectors – which is likely a better choice than creating giant concatenations of fixed length with zero-padding.或者将查询表示为词向量的平均值——这可能比创建具有零填充的固定长度的巨型串联更好的选择。 Or use other algorithms for modeling the text, like 'Paragraph Vector' ( Doc2Vec in gensim ) or deeper neural-network modeling (which requires lots of data & training time).或者使用其他算法对文本进行建模,例如“段落向量”( Doc2Vec中的gensim )或更深层的神经网络建模(这需要大量数据和训练时间)。

(If you have or can acquire lots of domain-specific training data, training word-vectors on that text will likely give you more appropriate word-vectors than reusing those from GoogleNews . Those vectors were trained on professional news stories from a corpus circa 2013, which will have a very different set of word-spellings and prominent word-senses than what seems to be your main interest, user-typed queries.) (如果您拥有或可以获得大量特定领域的训练数据,那么在该文本上训练词向量可能会比重新使用来自GoogleNews的词向量更合适。这些向量是根据 2013 年左右的语料库中的专业新闻报道训练的,这将有一组非常不同的单词拼写和突出的词义,而不是你的主要兴趣,用户输入的查询。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM