简体   繁体   中英

Find most similar words for OOV word

I am looking for the most similar words for out-of-vocab OOV words using gensim. Something like this:

    def get_word_vec(self, model, word):
    try:
        if word not in model.wv.vocab:
            mostSimWord = model.wv.similar_by_word(word)
            print(mostSimWord)
        else:
            print( word )
    except Exception as ex:
        print(ex)

Is there are way to achieve this task? Options other than gensim also welcomed.

If you train a FastText model instead of a Word2Vec model, it inherently learns vectors for word-fragments (of configurable size ranges) in addition to full words.

In languages like English & many others (but not all), unknown words are often typos, alternate forms, or related in terms of roots and suffixes to knwon words. Thus, having vectors for subwords, then using those to tally up a good guess vector for an unknown word, can work well enough to be worth trying – better than ignoring such words, or using a totally random or origin-point vector.

There's no built-in way to try to extract such relationships from an existing set of word-vectors that isn't FastText /subword-based – but it'd be theoretically possible. You could compute edit distances to, or counts-of-shared-subwords with, all known words, & create a guess-vector by weighted combination of the N-closest words. (This might work really well with typos & rarer alternate spellings, but not as much for truly-absent novel words.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM