简体   繁体   中英

FastText in Gensim

I am using Gensim to load my fasttext .vec file as follows.

m=load_word2vec_format(filename, binary=False)

However, I am just confused if I need to load .bin file to perform commands like m.most_similar("dog") , m.wv.syn0 , m.wv.vocab.keys() etc.? If so, how to do it?

Or .bin file is not important to perform this cosine similarity matching?

Please help me!

The gensim-lib has evolved, so some code fragments got deprecated. This is an actual working solution:

import gensim.models.wrappers.fasttext
model = gensim.models.wrappers.fasttext.FastTextKeyedVectors.load_word2vec_format(Source + '.vec', binary=False, encoding='utf8')
word_vectors = model.wv
# -- this saves space, if you plan to use only, but not to train, the model:
del model

# -- do your work:
word_vectors.most_similar("etc") 

The following can be used:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(link to the .vec file)
model.most_similar("summer")
model.similarity("summer", "winter")

Many options to use the model now.

If you want to be able to retrain the gensim model later with additional data, you should save the whole model like this: model.save("fasttext.model") . If you save just the word vectors with model.wv.save_word2vec_format(Path("vectors.txt")) , you will still be able to perform any of the functions that vectors provide - like similarity, but you will not be able to retrain the model with more data.

Note that if you are saving the whole model, you should pass a file name as a string instead of wrapping it in get_tmpfile , as suggested in the documentation here .

Maybe I am late in answering this: But here you can find your answer in the documentation: https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2. Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM