I am new to study word2vec model. Here is the code, I want to use sentences list to finetune word2vec model (From Gensim 4.1.0)
from gensim.models import Word2Vec, KeyedVectors
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2'
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]
# load GoogleNews-vectors-negative300.bin
model = Word2Vec(sentences, vector_size=300, min_count=1, epochs=10)
model.build_vocab(sentences)
total_examples = model.corpus_count
print('total_examples:', total_examples)
model.wv.intersect_word2vec_format("GoogleNews-vectors**strong text**-negative300.bin", binary=True, lockf=1.0)
print('success')
model.train(sentences, total_examples=total_examples, epochs=model.epochs)
model.save("word2vec_model1")
I first got error below:
model.wv.intersect_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, lockf=1.0) self.vectors_lockf[self.get_index(word)] = lockf # lock-factor: 0.0=no changes IndexError: index 12 is out of bounds for axis 0 with size 1
Then I look into resource code in keyedvectores.py, code as below:
if word in self.key_to_index:
overlap_count += 1
self.vectors[self.get_index(word)] = weights
self.vectors_lockf[self.get_index(word)] = lockf # lock-factor: 0.0=no changes
I use ctrl+right click "vectors_lockf" to see this variance's declaration, but it shows "cannot find declaration to go"...
After that, when debugging, I find vectors_lockf is ndarray{ (1,)}[1.], but I don't know how it's generated... which code generates it.
This issue was not in gensim 3.8, vectors_lockf array was initialized internally but after update in gensim 4 they expect users to initialize this array manually when needed(you can see the comment in code).
After that, when debuging, I find vectors_lockf is ndarray{(1,)}[1.], but I don't know how it's generated...which code generate it..
It is initialized in init () method of Word2Vec class (word2vec.py)
if not hasattr(self, 'wv'): # set unless subclass already set (eg: FastText)
self.wv = KeyedVectors(vector_size)
# EXPERIMENTAL lockf feature; create minimal no-op lockf arrays (1 element of 1.0)
# advanced users should directly resize/adjust as desired after any vocab growth
self.wv.vectors_lockf = np.ones(1, dtype=REAL) # 0.0 values suppress word-backprop-updates; 1.0 allows
You can initialize this numpy array whenever you build vocab
model.build_vocab(sentences)
model.wv.vectors_lockf = np.ones(len(model.wv), dtype=REAL)
you can ignore dtype since default is float in numpy or you can import REAL dtype from gensim.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.