简体   繁体   English

如何从gensim中的Word2Vec模型中完全删除单词?

[英]How to remove a word completely from a Word2Vec model in gensim?

Given a model, eg 给出一个模型,例如

from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

It's possible to remove the word from the w2v vocabulary, eg 可以从w2v词汇表中删除单词,例如

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433  0.08862179  0.08601206  0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

But when we do a similarity on other words after deleting graph , we see the word graph popping up, eg 但是当我们在删除graph后对其他单词进行相似性时,我们会看到单词graph弹出,例如

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

How to remove a word completely from a Word2Vec model in gensim? 如何从gensim中的Word2Vec模型中完全删除单词?


Updated 更新

To answer @vumaasha's comment: 要回答@ vumaasha的评论:

could you give some details as to why you want to delete a word 你能否提供一些关于你想删除一个单词的细节

  • Lets say my universe of words in all words in the corpus to learn the dense relations between all words. 让我说我的语料库中所有单词中的单词世界,以学习所有单词之间的密集关系。

  • But when I want to generate the similar words, it should only come from a subset of domain specific word. 但是当我想生成类似的单词时,它应该只来自域特定单词的子集。

  • It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient. 可以从.most_similar()生成足够多的内容然后过滤单词,但是让我们说特定域的空间很小,我可能正在寻找排名第1000的最相似的单词,效率很低。

  • It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain. 如果从单词向量中完全删除单词会更好,那么.most_similar()单词将不会返回特定域之外的单词。

I wrote a function which removes words from KeyedVectors which aren't in a predefined word list. 我写了一个函数,它从KeyedVectors中删除不在预定义单词列表中的单词。

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = new_vectors
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    w2v.vectors_norm = new_vectors_norm

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors . 它会重写与基于Word2VecKeyedVectors的单词相关的所有变量。

Usage: 用法:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355), [( '啤酒',0.8409687876701355),
('lager', 0.7733745574951172), ( '啤酒',0.7733745574951172),
('Beer', 0.71753990650177), ('啤酒',0.71753990650177),
('drinks', 0.668931245803833), ('drink',0.668931245803833),
('lagers', 0.6570086479187012), ('lagers',0.6570086479187012),
('Yuengling_Lager', 0.655455470085144), ('Yuengling_Lager',0.655455470085144),
('microbrew', 0.6534324884414673), ('microbrew',0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372), ('Brooklyn_Lager',0.6501551866531372),
('suds', 0.6497018337249756), ('suds',0.6497018337249756),
('brewed_beer', 0.6490240097045898)] ('brewed_beer',0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116), [('lagers',0.6570085287094116),
('wine', 0.6217695474624634), ('wine',0.6217695474624634),
('bash', 0.20583480596542358), ('bash',0.20583480596542358),
('computer', 0.06677375733852386), ('computer',0.06677375733852386),
('python', 0.005948573350906372)] ('python',0.005948573350906372)]

There is no direct way to do what you are looking for. 没有直接的方法来做你想要的。 However, you are not completely lost. 但是,你并没有完全迷失。 The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). most_similar方法在most_similar类中WordEmbeddingsKeyedVectors (检查链接)。 You can take a look at this method and modify it to suit your needs. 您可以查看此方法并根据需要进行修改。

The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. 下面显示的执行计算相似单词的实际逻辑,您需要将变量limited与您感兴趣的单词对应的向量。 Then you are done 然后你就完成了

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
        dists = dot(limited, mean)
        if not topn:
            return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)

Update: 更新:

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]

If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. 如果你看到这条线,这意味着如果restrict_vocab使用是它限制前n个词语的词汇,它才有意义,如果你已经整理通过频率的词汇。 If you are not passing restrict_vocab, self.vectors_norm is what goes into limited 如果你没有传递restrict_vocab, self.vectors_norm是有限的

the method most_similar calls another method init_sims . most_similar方法调用另一个方法init_sims This initializes the value for [self.vector_norm][4] like shown below 这将初始化[self.vector_norm][4]的值,如下所示

        self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. 所以,你可以拾取你感兴趣的单词,准备他们的标准并用它来代替有限的。 This should work 这应该工作

Note that this does not trim the model per se. 请注意,这不会修剪模型本身。 It trims the KeyedVectors object that the similarity look-ups is based on. 它修剪了相似性查找基于的KeyedVectors对象。

Suppose you only want to keep the top 5000 words in your model. 假设您只想在模型中保留前5000个单词。

wv = w2v_model.wv
words_to_trim = wv.index2word[5000:]
# In op's case 
# words_to_trim = ['graph'] 
ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

for w in words_to_trim:
    del wv.vocab[w]

wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
wv.init_sims(replace=True)

for i in sorted(ids_to_trim, reverse=True):
    del(wv.index2word[i])

This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word. 这样做是因为BaseKeyedVectors类包含以下属性:self.vectors,self.vectors_norm,self.vocab,self.vector_size,self.index2word。

The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format() , the file is much smaller. 这样做的好处是,如果使用save_word2vec_format()等方法编写KeyedVectors,则文件要小得多。

Have tried and felt that the most straightforward way is as follows: 尝试过并感觉最简单的方法如下:

  1. Get the Word2Vec embeddings in text file format. 以文本文件格式获取Word2Vec嵌入。
  2. Identify the lines corresponding to the word vectors that you would like to keep. 标识与您要保留的单词向量对应的行。
  3. Write a new text file Word2Vec embedding model. 写一个新的文本文件Word2Vec嵌入模型。
  4. Load model and enjoy (save to binary if you wish, etc.)... 加载模型并享受(如果你愿意,保存到二进制等)......

My sample code is as follows: 我的示例代码如下:

line_no = 0 # line0 = header
numEntities=0
targetLines = []

with open(file_entVecs_txt,'r') as fp:
    header = fp.readline() # header

    while True:
        line = fp.readline()
        if line == '': #EOF
            break
        line_no += 1

        isLatinFlag = True
        for i_l, char in enumerate(line):
            if not isLatin(char): # Care about entity that is Latin-only
                isLatinFlag = False
                break
            if char==' ': # reached separator
                ent = line[:i_l]
                break

        if not isLatinFlag:
            continue

        # Check for numbers in entity
        if re.search('\d',ent):
            continue

        # Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History')
        if re.match('^ENTITY/.*#',ent):
            continue

        targetLines.append(line_no)
        numEntities += 1

# Update header with new metadata
header_new = re.sub('^\d+',str(numEntities),header,count=1)

# Generate the file
txtWrite('',file_entVecs_SHORT_txt)
txtAppend(header_new,file_entVecs_SHORT_txt)

line_no = 0
ptr = 0
with open(file_entVecs_txt,'r') as fp:
    while ptr < len(targetLines):
        target_line_no = targetLines[ptr]

        while (line_no != target_line_no):
            fp.readline()
            line_no+=1

        line = fp.readline()
        line_no+=1
        ptr+=1
        txtAppend(line,file_entVecs_SHORT_txt)

FYI. 仅供参考。 FAILED ATTEMPT I tried out @zsozso's method (with the np.array modifications suggested by @Taegyung), left it to run overnight for at least 12 hrs, it was still stuck at getting new words from the restricted set...). 失败的尝试我尝试了@ zsozso的方法(使用np.array建议的np.array修改),让它隔夜运行至少12小时,它仍然坚持从限制集中获取新词......)。 This is perhaps because I have a lot of entities... But my text-file method works within an hour. 这可能是因为我有很多实体...但我的文本文件方法在一个小时内就可以工作了。

FAILED CODE 失败的代码

# [FAILED] Stuck at Building new vocab...
def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    print('Building new vocab..')

    for i in range(len(w2v.vocab)):

        if (i%int(1e6)==0) and (i!=0):
            print(f'working on {i}')

        word = w2v.index2entity[i]
        vec = np.array(w2v.vectors[i])
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    print('Assigning new vocab')
    w2v.vocab = new_vocab
    print('Assigning new vectors')
    w2v.vectors = np.array(new_vectors)
    print('Assigning new index2entity, index2word')
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    print('Assigning new vectors_norm')
    w2v.vectors_norm = np.array(new_vectors_norm)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM