简体   繁体   English

使用迁移学习对特定文章微调 word2vec

[英]fine tuning word2vec on a specific article, using transfer learning

i try to fine tune an exicting model on specific article.我尝试在特定文章上微调令人兴奋的 model。 I have tried transfer learning using genism build_vocab, adding gloveword2vec to a base model i trained on the article.我已经尝试使用 genism build_vocab 进行迁移学习,将 gloveword2vec 添加到我在文章中训练的基础 model 中。 but the build_vocab does not change the basic model- it is very small and no words are added to it's vocabulary.但是 build_vocab 并没有改变基本模型——它非常小,没有单词被添加到它的词汇表中。

this is the code: #load glove model这是代码:#load glove model

glove_file = datapath("/content/glove.6B.200d.txt")
tmp_file = get_tmpfile("test_word2vec.txt") 
_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)`

(in here - len(glove_vectors.wv.vocab) = 40000) (在这里 - len(glove_vectors.wv.vocab) = 40000)

#create good article basic model #创造好文章基础 model

base_model = Word2Vec(size=300, min_count=5) 
base_model.build_vocab([tokenizer.tokenize(data.text[0])]) 
total_examples = base_model.corpus_count`

(in here - len(base_model.wv.vocab) = 24) (在这里 - len(base_model.wv.vocab) = 24)

#add GloVe's vocabulary & weights base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True) #添加 GloVe 的词汇和权重base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

(in here- still - len(base_model_good_wv.vocab) = 24) (在这里 - 仍然 - len(base_model_good_wv.vocab) = 24)

#training #训练

base_model.train([tokenizer.tokenize(good_trump.text[0])], total_examples=total_examples, epochs=base_model.epochs+5) 
base_model_wv = base_model.wv

i think that the "base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)" does nothing- so there is no transfer learning.我认为“base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)”什么都不做——所以没有迁移学习。 any recommendations?有什么建议吗?

i relied on this article for the guideline...我依靠这篇文章作为指导......

Many articles at the 'Towards Data Science' site are very confused, to the point of misleading more than helping. “Towards Data Science”网站上的许多文章都非常混乱,误导多于帮助。 Unfortunately, the article you've linked is a good example:不幸的是,您链接的文章就是一个很好的例子:

  • The author first uses an unsupported value ( workers=-1 ) that manages to make his local-corpus training do nothing, and rather than discovering & fixing that error, incorrectly concludes he needs to use 'transfer learning'/'fine-tuning' instead.作者首先使用了一个不受支持的值 ( workers=-1 ) 设法使他的本地语料库训练什么都不做,而不是发现并修复该错误,错误地得出他需要使用“迁移学习”/“微调”的结论反而。 (He doesn't.) (他没有。)
  • He then tries to improvise a re-use of the GLoVe vectors, but as you've noted, his build_vocab() only manages to add the word-tokens to the model's vocabulary.然后,他尝试即兴创作 GLoVe 向量的重用,但正如您所指出的,他的build_vocab()仅设法将单词标记添加到模型的词汇表中。 This operation does not copy over any of the actual vectors!此操作不会复制任何实际向量!
  • Then, by doing training in a model where the default workers=3 was still in-effect, he finally does real training on just his own texts – no contribution from GLoVe values at all.然后,通过在默认workers=3仍然有效的 model 中进行训练,他最终只对自己的文本进行了真正的训练——根本没有 GLoVe 值的贡献。 He attributes the improvement to GLoVE, but really multiple mistakes have just cancelled each other.他将改进归功于 GLoVE,但实际上多个错误只是相互抵消。

I would avoid relying on a 'Towards Data Science' source if any other docs or tutorials are available.如果有任何其他文档或教程可用,我会避免依赖“迈向数据科学”来源。

Further, many who think they want to do re-use of someone else's pretrained vectors, with a small update from their own texts, should really just improve their own training corpus, so that they have one unified, evenly-trained model that covers all their needed words.此外,许多认为他们想重复使用别人的预训练向量,对自己的文本进行小幅更新的人,实际上应该只改进自己的训练语料库,这样他们就有了一个统一的、均匀训练的 model 覆盖所有他们需要的话。

There's no explicit support for 'fine-tuning' in Gensim. Gensim 中没有对“微调”的明确支持。 Bold advanced users can try to cobble it together from other methods, and tampering with the model between usual steps, but I've never seen a well-characterized & evaluated process for doing so.大胆的高级用户可以尝试通过其他方法将其拼凑在一起,并在常规步骤之间篡改 model,但我从未见过这样做的充分表征和评估的过程。 (Lots of the people fumbling through the process aren't even doing a good check of end-quality versus other approaches, just noting some improvement on a few ad hoc, perhaps unrepresentative tests.) (很多摸索着完成这个过程的人甚至没有对最终质量与其他方法进行很好的检查,只是注意到一些临时的、可能不具代表性的测试有所改进。)

Are you sure you need to do this?你确定你需要这样做吗? What was wrong with vectors taught on just your corpus?只在你的语料库上教授的向量有什么问题? Might extending your corpus with extra texts to expand its vocabulary work as well or better?可以用额外的文本扩展你的语料库来扩展它的词汇量吗?

Or, you could try translating the new domain words from your limited corpus & model into the same coordinate space as some older larger set of pretrained vectors that you like.或者,您可以尝试将有限语料库和 model 中的新域词翻译到与您喜欢的一些较旧的较大预训练向量集相同的坐标空间中。 There's an example of that process in a Gensim demo notebook using its utility TranslationMatrix class: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb在使用其实用程序TranslationMatrix class 的 Gensim 演示笔记本中有一个该过程的示例: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM