简体   繁体   中英

Gensim: Loss of Words/Tokens while Training

I have a corpus built out of Wikimedia Dump files stored at sentences.txt I have a sentence say 'नीरजः हाँ माता जी! स्कूल ख़त्म होते सीधा घर आऊँगा'

Now when I try to extract the word vectors there is always one or two words which have been missed out while training (despite being included in the list to be trained upon) and I get the KeyError. Is there any way to improve the training so that it doesn't miss out words that frequently?

Here is a proof that it does happen. tok.wordtokenize is a word tokenizer. sent.drawlist() as well as sents.drawlist() returns a list of sentences from the corpus stored inside sentences.txt .


>>> sentence = 'नीरजः हाँ माता जी! स्कूल ख़त्म होते सीधा घर आऊँगा'
>>> sentence = tok.wordtokenize(sentence) #tok.wordtokenize() is simply a word tokenizer.
>>> sentences = sent.drawlist()
>>> sentences = [tok.wordtokenize(i) for i in sentences]
>>> sentences2 = sents.drawlist()
>>> sentences2 = [tok.wordtokenize(i) for i in sentences2]
>>> sentences = sentences2 + sentences + sentence
>>> "नीरजः" in sentences #proof that the word is present inside sentences
True
>>> sentences[0:10] #list of tokenized sentences.
[['विश्व', 'भर', 'में', 'करोड़ों', 'टीवी', 'दर्शकों', 'की', 'उत्सुकता', 'भरी', 'निगाह', 'के', 'बीच', 'मिस', 'ऑस्ट्रेलिया', 'जेनिफर', 'हॉकिंस', 'को', 'मिस', 'यूनिवर्स-२००४', 'का', 'ताज', 'पहनाया', 'गया'], ['करीब', 'दो', 'घंटे', 'चले', 'कार्यक्रम', 'में', 'विभिन्न', 'देशों', 'की', '८०', 'सुंदरियों', 'के', 'बीच', '२०', 'वर्षीय', 'हॉकिंस', 'को', 'सर्वश्रेष्ठ', 'आंका', 'गया'], ['मिस', 'अमेरिका', 'शैंडी', 'फिनेजी', 'को', 'प्रथम', 'उप', 'विजेता', 'और', 'मिस', 'प्यूरेटो', 'रिको', 'अल्बा', 'रेइज', 'द्वितीय', 'उप', 'विजेता', 'चुनी', 'गई'], ['भारत', 'की', 'तनुश्री', 'दत्ता', 'अंतिम', '१०', 'प्रतिभागियों', 'में', 'ही', 'स्थान', 'बना', 'पाई'], ['हॉकिंस', 'ने', 'कहा', 'कि', 'जीत', 'के', 'बारे', 'में', 'उसने', 'सपने', 'में', 'भी', 'नहीं', 'सोचा', 'था'], ['सौंदर्य', 'की', 'यह', 'शीर्ष', 'प्रतियोगिता', 'क्विटो', 'के', 'कन्वेंशन', 'सेंटर', 'में', 'मंगलवार', 'देर', 'रात', 'शुरू', 'हुई'], ['करीब', '७५००', 'विशिष्ट', 'दर्शकों', 'की', 'मौजूदगी', 'में', 'विश्व', 'की', 'सर्वश्रेष्ठ', 'सुंदरी', 'के', 'चयन', 'की', 'कवायद', 'शुरू', 'हुई'], ['हर', 'चरण', 'के', 'बाद', 'लोगों', 'की', 'सांसे', 'थमने', 'लगतीं'], ['टीवी', 'पर', 'लुत्फ', 'उठा', 'रहे', 'दर्शक', 'अपने', 'देश', 'व', 'क्षेत्र', 'की', 'सुंदरी', 'की', 'प्रतियोगिता', 'में', 'स्थिति', 'के', 'बारे', 'में', 'व्यग्र', 'रहे'], ['फाइनल', 'में', 'पहुंचने', 'वाली', 'पांच', 'प्रतिभागियों', 'में', 'मिस', 'पेराग्वे', 'यानिना', 'गोंजालेज', 'और', 'मिस', 'त्रिनिदाद', 'व', 'टोबैगो', 'डेनियल', 'जोंस', 'भी', 'शामिल', 'थीं']]
>>> model = gensim.models.Word2Vec(sentences, size =10,  min_count=1) 
>>> pred = []
>>> for word in sentence:
...         pred.append(model.wv[word].tolist())
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/djokester/anaconda3/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 574, in __getitem__
    return self.word_vec(words)
  File "/home/djokester/anaconda3/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 273, in word_vec
    raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'नीरजः' not in vocabulary"

As you can see, I check for the word "नीरजः" inside the list of tokenized sentences. It is present in the list that I feed into the Word2Vec trainer and yet after training it is not in the vocabulary.

It should never 'miss' words that were included in the tokenized corpus, and had at least min_count occurrences. So if you get a KeyError , you can be confident that the associated word-token was never supplied during training.

In your example code to reproduce, take a close look at:

sentence = "Jack and Jill went up the Hill"
sentence = [word_tokenize(i) for i in sentence]

i in sentence will be each character of the string. It's unlikely your unshown word_tokenize() function does anything useful with the individual characters ['J', 'a', 'c', 'k', ' ', ...] - probably just leaving them as a list of letters. Then + -appending that to your other sentences makes sentences 30 items longer, rather than the single extra tokenized example you expect.

I suspect your real issue is different but related: something wrong with tokenization and composition. So check every step individually for the expected results and nested-types. (Using unique variables per step, like sentences_tokenized or sentence_tokenized instead of clobber-reusing variables like sentences and sentence , can help debug.)

Update as you suggested as edit: The issue with your latest code is that the line where you + -append is still wrong; it's appending each word in sentence as if it were its own new sentence. Looking at the results of each step – in the variable contents and lengths – should help make this clear. Also I again recommend not reusing variables for multiple steps while debugging. The line "नीरजः" in sentences #proof that the word is present inside sentences is actually proving sentences is wrong; that single word should not be an item in sentences , but in its single last list-of-tokens, sentences[-1] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM