简体   繁体   English

是否可以从python句子集中重新训练word2vec模型(例如GoogleNews-vectors-negative300.bin)?

[英]Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python 我正在使用经过预训练的Google新闻数据集通过在Python中使用Gensim库来获取单词向量

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

After loading the model I am converting training reviews sentence words into vectors 加载模型后,我正在将训练评论句子词转换为向量

#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

During word2Vec process i get a lot of errors for the words in my corpus, that are not in the model. 在word2Vec的过程中,我的语料库中的单词有很多错误,这些错误不在模型中。 Problem is how can i retrain already pre-trained model (eg GoogleNews-vectors-negative300.bin'), in order to get word vectors for those missing words. 问题是我该如何重新训练已经预先训练的模型(例如GoogleNews-vectors-negative300.bin'),以便为那些丢失的单词获取单词矢量。

Following is what I have tried: Trained a new model from training sentences that I had 以下是我尝试的方法:从我曾经接受过的训练句子中训练了一个新模型

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window    size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
# Initialize and train the model (this will take some time)
print "Training model..."
model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count, 
                      window = context, sample = downsampling)


model.build_vocab(sentences)
model.train(sentences)
model.n_similarity(["food"], ["rice"])

It worked! 有效! but the problem is that I have a really small dataset and less resources to train a large model. 但是问题是我的数据集非常少,而训练大型模型的资源却很少。

Second way that I am looking at is to extend the already trained model such as GoogleNews-vectors-negative300.bin. 我正在研究的第二种方法是扩展已经训练好的模型,例如GoogleNews-vectors-negative300.bin。

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
model.train(sentences)

Is it possible and is that a good way to use, please help me out 是否可能,并且是一种很好的使用方式,请帮帮我

This is how I technically solved the issue: 这就是我从技术上解决问题的方式:

Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/ 从Radim Rehurek准备带有可迭代语句的数据输入: https ://rare-technologies.com/word2vec-tutorial/

sentences = MySentences('newcorpus')  

Setting up the model 建立模型

model = gensim.models.Word2Vec(sentences)

Intersecting the vocabulary with the google word vectors 将词汇与Google单词向量相交

model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
                                lockf=1.0,
                                binary=True)

Finally executing the model and updating 最后执行模型并更新

model.train(sentences)

A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus... 警告提示:从实质的角度来看,一个可能很小的语料库是否可以实际上“改善”在一个庞大的语料库上训练的Google词向量当然是有争议的……

it is possible if model builder didn't finalize the model training . 如果模型建立者没有完成模型训练,就有可能。 in python it is: 在python中,它是:

model.sims(replace=True) #finalize the model

if the model didn't finalize it is a perfect way to have model with large dataset. 如果模型没有完成,这是拥有大型数据集的模型的理想方法。

Some folks have been working on extending gensim to allow online training. 一些人一直在致力于扩展gensim以允许在线培训。

A couple GitHub pull requests you might want to watch for progress on that effort: 几个GitHub拉取请求,您可能需要关注该工作的进展情况:

It looks like this improvement could allow updating the GoogleNews-vectors-negative300.bin model. 看来这项改进可以允许更新GoogleNews-vectors-negative300.bin模型。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 GoogleNews-vectors-negative300.bin导致word2vec错误 - Error for word2vec with GoogleNews-vectors-negative300.bin 不在 GoogleNews-vectors-negative300.bin 词汇表中的词 - Word not in vocabulary of GoogleNews-vectors-negative300.bin 导入 GoogleNews-vectors-negative300.bin - Import GoogleNews-vectors-negative300.bin 如何在 GoogleNews-vectors-negative300.bin 预训练模型中添加缺失词向量? - How to add missing words vectors in GoogleNews-vectors-negative300.bin pre-trained model? 没有这样的文件或目录:“GoogleNews-vectors-negative300.bin” - No such file or directory: 'GoogleNews-vectors-negative300.bin' 使用GoogleNews-vectors-negative300.bin构建字典返回ValueError:无法将字符串转换为float - Building dictionary with GoogleNews-vectors-negative300.bin returns ValueError: could not convert string to float GoogleNews-vectors-negative300.bin 无法在 gensim 模型 MemoryError 中加载 - GoogleNews-vectors-negative300.bin cannot be loaded in gensim models MemoryError 阅读GoogleNews-vectors-negative300.bin文件时,权限被拒绝错误 - permission denied error while reading the GoogleNews-vectors-negative300.bin file Python Gensim从向量创建Word2Vec模型(在ndarray中) - Python gensim create word2vec model from vectors (in ndarray) Python / Word2Vec:如何在两个轴上投射一个词,例如“男人-女人”和“贫富” - Python / Word2Vec: How to project a word on two axis e.g. 'man-woman' and 'rich-poor'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM