简体   繁体   中英

word2vec gensim multiple languages

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages ie English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:

model.wv.most_similar(positive = ['man'])
Out[14]: 
[('woman', 0.7380284070968628),
 ('lady', 0.6933152675628662),
 ('monk', 0.6662989258766174),
 ('guy', 0.6513140201568604),
 ('soldier', 0.6491742134094238),
 ('priest', 0.6440571546554565),
 ('farmer', 0.6366012692451477),
 ('sailor', 0.6297377943992615),
 ('knight', 0.6290514469146729),
 ('person', 0.6288090944290161)]
--------------------------------------------

Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,

model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will 
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215

This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.

This is what I am doing here:

#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList

#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")

First of all, you should really use self.wv.similarity().

I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.

Eg. Until your model knows that

Man:Aadmi::Woman:Aurat,

from the word embeddings, it can never make out the

Raja:King::Rani:Queen

relation. And for that, you need some anchor between the two corpuses. Here are a few suggestions that you can try out:

  1. Make an independent Hindi corpus/model
  2. Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
  3. Randomly replace input document words with their counterparts from the corresponding document while training

These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.

I have been dealing with a very similar problem and came across a reasonably robust solution. This paper shows that a linear relationship can be defined between two Word2Vec models that have been trained on different languages. This means you can derive a translation matrix to convert word embeddings from one language model into the vector space of another language model. What does all of that mean? It means I can take a word from one language, and find words in the other language that have a similar meaning.

I've written a small Python package that implements this for you: transvec . Here's an example where I use pre-trained models to search for Russian words and find English words with a similar meaning:

import gensim.downloader
from transvec.transformers import TranslationWordVectorizer

# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")

# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
    ("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
    ("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]

bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)

# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]

Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.

So basically, if you can provide a list of words with their translations, then you can train a TranslationWordVectorizer to translate any word that exists in your source language corpus into the target language. When I used this for real, I produced some training data by extracting all the individual Russian words from my data, running them through Google Translate and then keeping everything that translated to a single word in English. The results were pretty good (sorry I don't have any more detail for the benchmark yet; it's still a work in progress!).

After reading the comments, I think that the problem is in the very different grammatical construction between English and Hindi sentences. I have worked with Hindi NLP models and it is much more difficult to get similar results as English (since you mention it).

In Hindi there's no order between words at all, only when declining them. Moreover, the translation of a sentence between languages that are not even descendants of the same root language is somewhat random and you can not assume that the contexts of both sentences are similar.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM