簡體   English   中英

Gensim中的Word2Vec使用model.most_similar

[英]Word2Vec in Gensim using model.most_similar

我是Gensim的“ Word2Vec”的新手。 我想為文本建立一個Word2Vec模型(摘自Wikipedia:機器學習),並找到 “機器學習” 最相似的詞

我當前的代碼如下。

# import modules & set up logging
from gensim.models import Word2Vec

sentences = "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision."
# train word2vec on the sentences
model = Word2Vec(sentences, min_count=1)
vocab = list(model.wv.vocab.keys())
print(vocab[:10])

但是,對於vocab,我得到一個字符輸出。

['M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'l', 'r']

請通過使用model.most_similar幫助我獲取most_similar_words

Word2Vec類期望其sentences語料庫是各個項目的可迭代來源,每個項目都是一個單詞標記列表。

您要提供一個字符串。 如果對其進行迭代,則會得到單個字符。 然后,如果嘗試將這些單個字符解釋為令牌列表,它仍然只會得到一個字符-因此,它看到的唯一“單詞”是單個字符。

至少,您希望您的語料庫更像這樣構造:

sentences = [
    "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision.".split(),
]

那仍然只是一個“句子”,但是它將在空白處拆分為單詞令牌。

還要注意,有用的word2vec結果需要大量不同的文本樣本-玩具大小的示例通常不會顯示word2vec以創建而聞名的單詞相似性或單詞相對排列。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM