简体   繁体   中英

Word2Vec in Gensim using model.most_similar

I am new in 'Word2Vec' in Gensim. I want to build a Word2Vec model for the text (Extracted from Wikipedia: Machine Learning) and find most similar words to 'Machine Learning'.

My current code is as follows.

# import modules & set up logging
from gensim.models import Word2Vec

sentences = "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision."
# train word2vec on the sentences
model = Word2Vec(sentences, min_count=1)
vocab = list(model.wv.vocab.keys())
print(vocab[:10])

However, for vocab I get one character output.

['M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'l', 'r']

Please help me to get the most_similar_words by using using model.most_similar

The Word2Vec class expects its corpus of sentences to be an iterable source of individual items which are each a list-of-word-tokens.

You're providing a single string. If it iterates over that, it gets individual characters. If it then tries to interpret those individual characters as a list-of-tokens, it still just gets a single-character – so the only 'words' it sees are single characters.

At the very least, you'd want your corpus to be constructed more like this:

sentences = [
    "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision.".split(),
]

That's still just one 'sentence', but it'll be split-on-whitespace into word-tokens.

Note also that useful word2vec results require large, varied text samples – toy-sized examples won't usually show the kinds of word-similarities or word-relative-arrangements that word2vec is famous for creating.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM