繁体   English   中英

如何在Word2Vec中随机播放单词

[英]How to shuffle words in word2vec

我有这段代码:

import gensim
import random


file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')

read_data = file.read()

data = read_data.split('\n')

sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])

model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)

for epoch in range(5):
    shuffled_sentences = random.shuffle(sentences)
    model.train(shuffled_sentences)
    print(epoch)
    print(model)

model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')

如果我只打印一个句子,那么它的输出是这样的:

['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']

我需要的是在训练之前先将单词打乱然后保存模型。

我不确定我是否以正确的方式编码。 我最终例外:

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
    for sent_idx, sentence in enumerate(sentences):
  File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
    for document in self.corpus:
TypeError: 'NoneType' object is not iterable

我想问你我该如何洗牌。

Random.shuffle将列表Random.shuffle ,不返回任何内容。 因此,在此呼叫后,您打乱的句子为“ None

model.build_vocab(sentences)
sentences_list = sentences
Idx = range(len(sentences_list))
print(Idx)
for epoch in range(5):
    random.shuffle(sentences)
    perm_sentences = [sentences_list[i] for i in Idx]
    model.train(perm_sentences)
    print(epoch)
    print(model)
   model.save("somefile'.model')

这解决了我的问题。

但是,如何使句子中的单个单词混洗呢?

句子:['JO_3787672','JO_272304','JO_2027410','TI_2969041','TI_2509936','TA_954638','TA_4321623','TA_339347','TA_272304','TA_3017535','TA_494116','TA_798840' ]

我的目标是:如果我检查最相似的词,请说['JO_3787672'],然后每次它将预测从'JO_'开始的词。 而以“ TA_”和“ TI_”开头的单词的相似度分数要低得多。 我怀疑这是因为数据中的单词位置(我不确定)。 这就是为什么我尝试在单词之间改组(我真的不确定是否有帮助)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM