[英]How to load sentences into Python gensim?
我試圖在Python中使用gensim
自然語言處理庫中的word2vec
模塊。
文檔說要初始化模型:
from gensim.models import word2vec
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
gensim
對輸入句子的期望是什么格式? 我有原始文本
"the quick brown fox jumps over the lazy dogs"
"Then a cop quizzed Mick Jagger's ex-wives briefly."
etc.
我需要在word2fec
發布哪些額外的處理?
更新:這是我嘗試過的。 當它加載句子時,我什么也得不到。
>>> sentences = ['the quick brown fox jumps over the lazy dogs',
"Then a cop quizzed Mick Jagger's ex-wives briefly."]
>>> x = word2vec.Word2Vec()
>>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences])
>>> x.vocab
{}
utf-8
句子列表 。 您還可以從磁盤流式傳輸數據。
確保它是utf-8
,並將其拆分:
sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)
就像alKid
指出的那樣,把它變成utf-8
。
談論另外兩件可能需要擔心的事情。
您可以執行以下操作,而不是將大型列表加載到內存中:
import nltk, gensim
class FileToSent(object):
def __init__(self, filename):
self.filename = filename
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
接着,
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.