简体   繁体   English

在Spacy中对向量求平均时,请忽略词汇外的单词

[英]Ignore out-of-vocabulary words when averaging vectors in Spacy

I would like to use a pre-trained word2vec model in Spacy to encode titles by (1) mapping words to their vector embeddings and (2) perform the mean of word embeddings. 我想在Spacy中使用预先训练的word2vec模型,通过(1)将单词映射到其矢量嵌入和(2)执行单词嵌入的均值来对标题进行编码。

To do this I use the following code: 为此,我使用以下代码:

import spacy
nlp = spacy.load('myspacy.bioword2vec.model')
sentence = "I love Stack Overflow butitsalsodistractive"
avg_vector = nlp(sentence).vector

Where nlp(sentence).vector (1) tokenizes my sentence with white-space splitting, (2) vectorizes each word according to the dictionary provided and (3) averages the word vectors within a sentence to provide a single output vector. 其中nlp(sentence).vector (1)用空格分割标记我的句子,(2)根据提供的字典将每个单词向量化,(3)对句子中的单词向量求平均以提供单个输出向量。 That's fast and cool. 那又快又酷。

However, in this process, out-of-vocabulary (OOV) terms are mapped to n-dimensional 0 vectors, which affects the resulting mean. 但是,在此过程中,语音(OOV)术语被映射到n维0向量,这会影响所得平均值。 Instead, I would like OOV terms to be ignored when performing the average. 相反,我希望在执行平均值时忽略OOV项。 In my example, ' butitsalsodistractive ' is the only term not present in my dictionary, so I would like nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector . 在我的示例中,“ butitsalsodistractive ”是词典中不存在的唯一术语,因此我想使用nlp("I love Stack Overflow butitsalsodistractive").vector = nlp("I love Stack Overflow").vector

I have been able to do this with a post-processing step (see code below), but this becomes too slow for my purposes, so I was wondering if there is a way to tell the nlp pipeline to ignore OOV terms beforehand? 我已经可以通过后处理步骤做到这一点(请参见下面的代码),但是对于我的目的而言,这变得太慢了,因此我想知道是否有一种方法可以告诉nlp管道事先忽略OOV术语? So when calling nlp(sentence).vector it does not include OOV-term vectors when computing the mean 因此,在调用nlp(sentence).vector时,在计算均值时不包括OOV项向量

import numpy as np
avg_vector = np.asarray([word.vector for word in nlp(sentence) if word.has_vector]).mean(axis=0)

Approaches tried 尝试的方法

In both cases documents is a list with 200 string elements with ≈ 400 words each. 在这两种情况下, documents都是一个列表,其中包含200个字符串元素,每个字符串元素≈400个单词。

  1. Without dealing with OOV terms: 不处理OOV术语:
import spacy
import time
nlp = spacy.load('myspacy.bioword2vec.model')
times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [document.vector for document in list(nlp.pipe(documents))]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.0850741124153136 s
  1. Ignoring OOV terms in output vector. 忽略输出向量中的OOV项。 Note that in this case we need to add an extra 'if' statment for those cases in which all words are OOV (if this happens the output vector is r_vec): 请注意,在这种情况下,对于所有单词均为OOV的情况,我们需要添加一个额外的“ if”语句(如果发生这种情况,则输出向量为r_vec):
r_vec = np.random.rand(200) # Random vector for empty text
# Define function to obtain average vector given a document
def get_vector(text):
    vectors = np.asarray([word.vector for word in nlp(text) if word.has_vector])
    if vectors.size == 0:
        # Case in which none of the words in text were in vocabulary
        avg_vector = r_vec
    else:
        avg_vector = vectors.mean(axis=0)
    return avg_vector

times = []
for i in range(0, 100):
    init = time.time()
    documents_vec = [get_vector(document) for document in documents]
    fin = time.time()
    times.append(fin-init)
print("Mean time after 100 rounds:", sum(times)/len(times), "s")
# Mean time after 100 rounds: 2.4214172649383543 s

In this example the mean difference time in vectorizing 200 documents was 0.34s. 在此示例中,矢量化200个文档的平均时差时间为0.34s。 However, when processing 200M documents this becomes critical. 但是,当处理200M文档时,这变得很关键。 I am aware that the second approach needs an extra 'if' condition to deal with documents full of OOV terms, which might slightly increase computational time. 我知道第二种方法需要一个额外的'if'条件来处理充满OOV术语的文档,这可能会稍微增加计算时间。 In addition, in the first case I am able to use nlp.pipe(documents) to process all documents in one go, which I guess must optimize the process. 另外,在第一种情况下,我可以使用nlp.pipe(documents)处理所有文档,我猜想必须优化该过程。

I could always look for extra computational resources to apply the second piece of code, but I was wondering if there is any way of applying the nlp.pipe(documents) ignoring the OOV terms in the output. 我总是可以寻找额外的计算资源来应用第二段代码,但是我想知道是否有任何方法可以使用nlp.pipe(documents)忽略输出中的OOV术语。 Any suggestion will be very much welcome. 任何建议都将非常受欢迎。

see this post by the author of Spacy which says: 看到Spacy作者的这篇文章说:

The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want. Doc对象具有不可变的文本,但是用所需的标记子集创建一个新的Doc对象应该非常容易并且非常有效。

Try this for example: 尝试以下示例:

import spacy
nlp = spacy.load('en_core_web_md')
import numpy as np

sentence = "I love Stack Overflow butitsalsodistractive"

print(sentence)
tokens = nlp(sentence)
print([t.text for t in tokens])
cleanText = " ".join([token.text for token in tokens if token.has_vector])
print(clean)
tokensClean = nlp(cleanText)
print([t.text for t in tokensClean])


np.array_equal(tokens.vector, tokensClean.vector)
#False

If you want to speed things up, disable the pipeline components in spacy with you don't use (such as NER, dependency parse, etc ..) 如果要加快处理速度,请在不使用的情况下谨慎禁用管道组件(例如NER,依赖项解析等..)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM