如何使用gensim在Wikipedia页面上训练Word2Vec模型？

Question

After reading this article , I start to train my own model. 阅读本文之后，我将开始训练自己的模型。 The problem is that the author does not make it clear what the sentences in Word2Vec should be like. 问题在于作者并不清楚Word2Vec的sentences应该是什么样。

I download the text from a Wikipedia page, as it is written is the article, and I make a list of sentences from it: 我从Wikipedia页面下载了文本，因为它是本文写的内容，并从中列出句子列表：

sentences = [word for word in wikipage.content.split('.')]

So, for example, sentences[0] looks like: 因此，例如， sentences[0]看起来像：

'Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed'

Then I try to train a model with this list: 然后，我尝试使用此列表训练模型：

model = Word2Vec(sentences, min_count=2, size=50, window=10,  workers=4)

But the dictionary of the model consists of letters! 但是模型的字典由字母组成！ For example, the output of model.wv.vocab.keys() is: 例如， model.wv.vocab.keys()的输出为：

dict_keys([',', 'q', 'D', 'B', 'p', 't', 'o', '(', ')', '0', 'V', ':', 'j', 's', 'R', '{', 'g', '-', 'y', 'c', '9', 'I', '}', '1', 'M', ';', '`', '\n', 'i', 'r', 'a', 'm', '–', 'v', 'N', 'h', '/', 'P', 'F', '8', '"', '’', 'W', 'T', 'u', 'U', '?', ' ', 'n', '2', '=', 'w', 'C', 'O', '6', '&', 'd', '4', 'S', 'J', 'E', 'b', 'L', '$', 'l', 'e', 'H', '≈', 'f', 'A', "'", 'x', '\\', 'K', 'G', '3', '%', 'k', 'z'])

What am I doing wrong? 我究竟做错了什么？ Thanks in advance! 提前致谢！

Answer 1

The input to the Word2Vec model object could be a list of list of words, using the tokenization function in nltk : 使用nltk的标记化功能，可以将Word2Vec模型对象的输入作为单词列表的列表：

>>> import wikipedia
>>> from nltk import sent_tokenize, word_tokenize
>>> page = wikipedia.page('machine learning')
>>> sentences = [word_tokenize(sent) for sent in sent_tokenize(page.content)]
>>> sentences[0]
['Machine', 'learning', 'is', 'the', 'subfield', 'of', 'computer', 'science', 'that', 'gives', 'computers', 'the', 'ability', 'to', 'learn', 'without', 'being', 'explicitly', 'programmed', '.']

And feed it in: 并输入：

>>> from gensim.models import Word2Vec
>>> model = Word2Vec(sentences, min_count=2, size=50, window=10,  
>>> list(model.wv.vocab.keys())[:10]
['sparsely', '(', 'methods', 'their', 'typically', 'information', 'assessment', 'False', 'often', 'problems']

But in general, an generator (of sentence) that contains a generator (of words) would work too, ie: 但总的来说，包含（单词）生成器的（句子）生成器也可以工作，即：

>>> from gensim.utils import tokenize
>>> paragraphs = map(tokenize, page.content.split('\n')) # paragraphs
>>> model = Word2Vec(paragraphs, min_count=2, size=50, window=10,  workers=4)
>>> list(model.wv.vocab.keys())[:10]
['sparsely', 'methods', 'their', 'typically', 'information', 'assessment', 'False', 'often', 'problems', 'symptoms']

如何使用gensim在Wikipedia页面上训练Word2Vec模型？

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-03-24 06:51:46

如何使用gensim在Wikipedia页面上训练Word2Vec模型？

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-03-24 06:51:46

解决方案1
5 已采纳 2017-03-24 06:51:46