简体   繁体   English

Gensim中的Word2Vec使用model.most_similar

[英]Word2Vec in Gensim using model.most_similar

I am new in 'Word2Vec' in Gensim. 我是Gensim的“ Word2Vec”的新手。 I want to build a Word2Vec model for the text (Extracted from Wikipedia: Machine Learning) and find most similar words to 'Machine Learning'. 我想为文本建立一个Word2Vec模型(摘自Wikipedia:机器学习),并找到 “机器学习” 最相似的词

My current code is as follows. 我当前的代码如下。

# import modules & set up logging
from gensim.models import Word2Vec

sentences = "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision."
# train word2vec on the sentences
model = Word2Vec(sentences, min_count=1)
vocab = list(model.wv.vocab.keys())
print(vocab[:10])

However, for vocab I get one character output. 但是,对于vocab,我得到一个字符输出。

['M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'l', 'r']

Please help me to get the most_similar_words by using using model.most_similar 请通过使用model.most_similar帮助我获取most_similar_words

The Word2Vec class expects its corpus of sentences to be an iterable source of individual items which are each a list-of-word-tokens. Word2Vec类期望其sentences语料库是各个项目的可迭代来源,每个项目都是一个单词标记列表。

You're providing a single string. 您要提供一个字符串。 If it iterates over that, it gets individual characters. 如果对其进行迭代,则会得到单个字符。 If it then tries to interpret those individual characters as a list-of-tokens, it still just gets a single-character – so the only 'words' it sees are single characters. 然后,如果尝试将这些单个字符解释为令牌列表,它仍然只会得到一个字符-因此,它看到的唯一“单词”是单个字符。

At the very least, you'd want your corpus to be constructed more like this: 至少,您希望您的语料库更像这样构造:

sentences = [
    "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision.".split(),
]

That's still just one 'sentence', but it'll be split-on-whitespace into word-tokens. 那仍然只是一个“句子”,但是它将在空白处拆分为单词令牌。

Note also that useful word2vec results require large, varied text samples – toy-sized examples won't usually show the kinds of word-similarities or word-relative-arrangements that word2vec is famous for creating. 还要注意,有用的word2vec结果需要大量不同的文本样本-玩具大小的示例通常不会显示word2vec以创建而闻名的单词相似性或单词相对排列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM