如何在Word2Vec中随机播放单词

Question

我有这段代码：

import gensim
import random


file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')

read_data = file.read()

data = read_data.split('\n')

sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])

model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)

for epoch in range(5):
    shuffled_sentences = random.shuffle(sentences)
    model.train(shuffled_sentences)
    print(epoch)
    print(model)

model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')

如果我只打印一个句子，那么它的输出是这样的：

['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']

我需要的是在训练之前先将单词打乱然后保存模型。

我不确定我是否以正确的方式编码。 我最终例外：

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
    for sent_idx, sentence in enumerate(sentences):
  File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
    for document in self.corpus:
TypeError: 'NoneType' object is not iterable

我想问你我该如何洗牌。

Answer 1

Random.shuffle将列表Random.shuffle ，不返回任何内容。 因此，在此呼叫后，您打乱的句子为“ None 。

Answer 2

model.build_vocab(sentences)
sentences_list = sentences
Idx = range(len(sentences_list))
print(Idx)
for epoch in range(5):
    random.shuffle(sentences)
    perm_sentences = [sentences_list[i] for i in Idx]
    model.train(perm_sentences)
    print(epoch)
    print(model)
   model.save("somefile'.model')

这解决了我的问题。

但是，如何使句子中的单个单词混洗呢？

句子：['JO_3787672'，'JO_272304'，'JO_2027410'，'TI_2969041'，'TI_2509936'，'TA_954638'，'TA_4321623'，'TA_339347'，'TA_272304'，'TA_3017535'，'TA_494116'，'TA_798840' ]

我的目标是：如果我检查最相似的词，请说['JO_3787672']，然后每次它将预测从'JO_'开始的词。 而以“ TA_”和“ TI_”开头的单词的相似度分数要低得多。 我怀疑这是因为数据中的单词位置（我不确定）。 这就是为什么我尝试在单词之间改组（我真的不确定是否有帮助）。

如何在Word2Vec中随机播放单词

问题描述

2 个解决方案

解决方案1
0 2016-05-08 17:09:23

解决方案2
0 2016-05-09 08:34:03

如何在Word2Vec中随机播放单词

问题描述

2 个解决方案

解决方案1 0 2016-05-08 17:09:23

解决方案2 0 2016-05-09 08:34:03

解决方案1
0 2016-05-08 17:09:23

解决方案2
0 2016-05-09 08:34:03