简体   繁体   中英

Gensim's `model.wv.most_similar` returns phonologically similar words

gensim's wv.most_similar returns phonologically close words (similar sounds) instead of semantically similar ones. Is this normal? Why might this happen?

Here's the documentation on most_similar : https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar

In [144]: len(vectors.vocab)
Out[144]: 32966

... 

In [140]: vectors.most_similar('fight')
Out[140]:
[('Night', 0.9940935373306274),
 ('knight', 0.9928507804870605),
 ('fright', 0.9925899505615234),
 ('light', 0.9919329285621643),
 ('bright', 0.9914385080337524),
 ('plight', 0.9912853240966797),
 ('Eight', 0.9912533760070801),
 ('sight', 0.9908033013343811),
 ('playwright', 0.9905624985694885),
 ('slight', 0.990411102771759)]

In [141]: vectors.most_similar('care')
Out[141]:
[('spare', 0.9710584878921509),
 ('scare', 0.9626247882843018),
 ('share', 0.9594929218292236),
 ('prepare', 0.9584596157073975),
 ('aware', 0.9551078081130981),
 ('negare', 0.9550014138221741),
 ('glassware', 0.9507938027381897),
 ('Welfare', 0.9489598274230957),
 ('warfare', 0.9487678408622742),
 ('square', 0.9473209381103516)]

The training data contains academic papers and this was my training script:

from gensim.models.fasttext import FastText as FT_gensim
import gensim.models.keyedvectors as word2vec

dim_size = 300
epochs = 10
model = FT_gensim(size=dim_size, window=3, min_count=1)
model.build_vocab(sentences=corpus_reader, progress_per=1000)
model.train(sentences=corpus_reader, total_examples=total_examples, epochs=epochs)

# saving vectors to disk
path = "/home/ubuntu/volume/my_vectors.vectors"
model.wv.save_word2vec_format(path, binary=True)

# loading vectors 
vectors = word2vec.KeyedVectors.load_word2vec_format(path)

You've chosen to use the FastText algorithm to train your vectors. That algorithm specifically makes use of subword fragments (like 'ight' or 'are' ) to have a chance of synthesizing good guess-vectors for 'out-of-vocabulary' words that weren't in the training set, and that could be one contributor to the results you're seeing.

However, usually words' unique meanings predominate, with the influence of such subwords only coming into play for unknown words. And, it's rare for the most-similar lists of any words in a healthy set of word-vectors to have so many 0.99 + similarities.

So, I suspect there's something weird or deficient in your training data.

What kind of text is it, and how many total words of example usages does it contain?

Were there any perplexing aspects of training progress/speed shown in INFO-level logs during training?

(300 dimensions may also be a bit excessive with a vocabulary of only 33K unique words; that's a vector-size that's common in work with hundreds of thousands to millions of unique words, and plentiful training data.)

That's a good call-out on the dimension size. Reducing that param definitely did make a difference.

1. Reproducing the original behavior (where dim_size=300) with a larger corpus (33k --> 275k unique vocab):

Note: I've also tweaked a few other params, like min_count , window , etc.)

from gensim.models.fasttext import FastText as FT_gensim

fmodel0 = FT_gensim(size=300, window=5, min_count=3, workers=10) # window is The maximum distance between the current and predicted word within a sentence.
fmodel0.build_vocab(sentences=corpus)
fmodel0.train(sentences=corpus, total_examples=fmodel0.corpus_count, epochs=5)

fmodel0.wv.vocab['cancer'].count  # number of times the word occurred in the corpus
fmodel0.wv.most_similar('cancer')
fmodel0.wv.most_similar('care')
fmodel0.wv.most_similar('fight')

# -----------
# cancer 
[('breastcancer', 0.9182084798812866),
 ('noncancer', 0.9133851528167725),
 ('skincancer', 0.898530900478363),
 ('cancerous', 0.892244279384613),
 ('cancers', 0.8634265065193176),
 ('anticancer', 0.8527657985687256),
 ('Cancer', 0.8359113931655884),
 ('lancer', 0.8296531438827515),
 ('Anticancer', 0.826178252696991),
 ('precancerous', 0.8116365671157837)]

# care
[('_care', 0.9151567816734314),
 ('încălcare', 0.874087929725647),
 ('Nexcare', 0.8578598499298096),
 ('diacare', 0.8515325784683228),
 ('încercare', 0.8445525765419006),
 ('fiecare', 0.8335763812065125),
 ('Mulcare', 0.8296753168106079),
 ('Fiecare', 0.8292017579078674),
 ('homecare', 0.8251558542251587),
 ('carece', 0.8141698837280273)]

# fight
[('Ifight', 0.892048180103302),
 ('fistfight', 0.8553390502929688),
 ('dogfight', 0.8371964693069458),
 ('fighter', 0.8167843818664551),
 ('bullfight', 0.8025394678115845),
 ('gunfight', 0.7972971200942993),
 ('fights', 0.790093183517456),
 ('Gunfight', 0.7893823385238647),
 ('fighting', 0.775499701499939),
 ('Fistfight', 0.770946741104126)]

2. Reducing the dimension size to 5:

_fmodel = FT_gensim(size=5, window=5, min_count=3, workers=10)
_fmodel.build_vocab(sentences=corpus)
_fmodel.train(sentences=corpus, total_examples=_fmodel.corpus_count, epochs=5)  # workers is specified in the constructor


_fmodel.wv.vocab['cancer'].count  # number of times the word occurred in the corpus
_fmodel.wv.most_similar('cancer')
_fmodel.wv.most_similar('care')
_fmodel.wv.most_similar('fight')

# cancer 
[('nutrient', 0.999614417552948),
 ('reuptake', 0.9987781047821045),
 ('organ', 0.9987629652023315),
 ('tracheal', 0.9985960721969604),
 ('digestion', 0.9984923601150513),
 ('cortes', 0.9977986812591553),
 ('liposomes', 0.9977765679359436),
 ('adder', 0.997713565826416),
 ('adrenals', 0.9977011680603027),
 ('digestive', 0.9976763129234314)]

# care
[('lappropriate', 0.9990135431289673),
 ('coping', 0.9984776973724365),
 ('promovem', 0.9983049035072327),
 ('requièrent', 0.9982239603996277),
 ('diverso', 0.9977829456329346),
 ('feebleness', 0.9977156519889832),
 ('pathetical', 0.9975940585136414),
 ('procure', 0.997504472732544),
 ('delinking', 0.9973599910736084),
 ('entonces', 0.99733966588974)]

# fight 
[('decied', 0.9996457099914551),
 ('uprightly', 0.999250054359436),
 ('chillies', 0.9990670680999756),
 ('stuttered', 0.998710036277771),
 ('cries', 0.9985755681991577),
 ('famish', 0.998246431350708),
 ('immortalizes', 0.9981046915054321),
 ('misled', 0.9980905055999756),
 ('whore', 0.9980045557022095),
 ('chanted', 0.9978444576263428)]

It's not GREAT, but it's no longer returning words that merely contain the subwords.

3. And for good measure, benchmark against Word2Vec:

from gensim.models.word2vec import Word2Vec

wmodel300 = Word2Vec(corpus, size=300, window=5, min_count=2, workers=10)
wmodel300.total_train_time  # 187.1828162111342
wmodel300.wv.most_similar('cancer')

[('cancers', 0.6576876640319824),
 ('melanoma', 0.6564366817474365),
 ('malignancy', 0.6342018842697144),
 ('leukemia', 0.6293295621871948),
 ('disease', 0.6270142197608948),
 ('adenocarcinoma', 0.6181445121765137),
 ('Cancer', 0.6010828614234924),
 ('tumors', 0.5926551222801208),
 ('carcinoma', 0.5917977094650269),
 ('malignant', 0.5778893828392029)]

^ Better captures distributional similarity + much more realisitic similarity measures.

But with a smaller dim_size, the result is somewhat worse (also the similarities are less realistic, all around.99):

wmodel5 = Word2Vec(corpus, size=5, window=5, min_count=2, workers=10)
wmodel5.total_train_time  # 151.4945764541626
wmodel5.wv.most_similar('cancer')

[('insulin', 0.9990534782409668),
 ('reaction', 0.9970406889915466),
 ('embryos', 0.9970351457595825),
 ('antibiotics', 0.9967449903488159),
 ('supplements', 0.9962579011917114),
 ('synthesize', 0.996055543422699),
 ('allergies', 0.9959680438041687),
 ('gadgets', 0.9957243204116821),
 ('mild', 0.9953152537345886),
 ('asthma', 0.994774580001831)]

Therefore, increasing the dimension size seems to help Word2Vec, but not fastText...

I'm sure this contrast has to do with the fact that the fastText model is learning subword info and somehow that's interacting with the param in a way increasing its value is hurtful. But I'm not sure how exactly... I'm trying to reconcile this finding with the intuition that increasing the size of the vectors should help in general because larger vectors capture more information.

I had the same issue with a corpus of 366k words. I think the problem is in the min_n max_n parameters. Try using

word_ngrams = 0

It is equivalent to word2vec according to documentation. Or try set min_n and max_n to bigger values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM