简体   繁体   中英

fine tuning word2vec on a specific article, using transfer learning

i try to fine tune an exicting model on specific article. I have tried transfer learning using genism build_vocab, adding gloveword2vec to a base model i trained on the article. but the build_vocab does not change the basic model- it is very small and no words are added to it's vocabulary.

this is the code: #load glove model

glove_file = datapath("/content/glove.6B.200d.txt")
tmp_file = get_tmpfile("test_word2vec.txt") 
_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)`

(in here - len(glove_vectors.wv.vocab) = 40000)

#create good article basic model

base_model = Word2Vec(size=300, min_count=5) 
base_model.build_vocab([tokenizer.tokenize(data.text[0])]) 
total_examples = base_model.corpus_count`

(in here - len(base_model.wv.vocab) = 24)

#add GloVe's vocabulary & weights base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

(in here- still - len(base_model_good_wv.vocab) = 24)

#training

base_model.train([tokenizer.tokenize(good_trump.text[0])], total_examples=total_examples, epochs=base_model.epochs+5) 
base_model_wv = base_model.wv

i think that the "base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)" does nothing- so there is no transfer learning. any recommendations?

i relied on this article for the guideline...

Many articles at the 'Towards Data Science' site are very confused, to the point of misleading more than helping. Unfortunately, the article you've linked is a good example:

  • The author first uses an unsupported value ( workers=-1 ) that manages to make his local-corpus training do nothing, and rather than discovering & fixing that error, incorrectly concludes he needs to use 'transfer learning'/'fine-tuning' instead. (He doesn't.)
  • He then tries to improvise a re-use of the GLoVe vectors, but as you've noted, his build_vocab() only manages to add the word-tokens to the model's vocabulary. This operation does not copy over any of the actual vectors!
  • Then, by doing training in a model where the default workers=3 was still in-effect, he finally does real training on just his own texts – no contribution from GLoVe values at all. He attributes the improvement to GLoVE, but really multiple mistakes have just cancelled each other.

I would avoid relying on a 'Towards Data Science' source if any other docs or tutorials are available.

Further, many who think they want to do re-use of someone else's pretrained vectors, with a small update from their own texts, should really just improve their own training corpus, so that they have one unified, evenly-trained model that covers all their needed words.

There's no explicit support for 'fine-tuning' in Gensim. Bold advanced users can try to cobble it together from other methods, and tampering with the model between usual steps, but I've never seen a well-characterized & evaluated process for doing so. (Lots of the people fumbling through the process aren't even doing a good check of end-quality versus other approaches, just noting some improvement on a few ad hoc, perhaps unrepresentative tests.)

Are you sure you need to do this? What was wrong with vectors taught on just your corpus? Might extending your corpus with extra texts to expand its vocabulary work as well or better?

Or, you could try translating the new domain words from your limited corpus & model into the same coordinate space as some older larger set of pretrained vectors that you like. There's an example of that process in a Gensim demo notebook using its utility TranslationMatrix class: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM