I have a function to extract the pre trained embeddings from GloVe.txt
and load them as Kears Embedding Layer
weights but how can I do for the same for the given two files?
This accepted stackoverflow answer gave me aa feel that .vec
can be seen as .txt
and we might use the same technique to extract the fasttext.vec
which we use for glove.txt
. Is my understanding correct?
I went through a lot of blogs and stack answers to find what to do with the binary file? And I found in this stack answer that binary or .bin
file is the MODEL itself not the embeddings and you can convert the bin file to text file using Gensim
. I think it does something to save the embeddings and we can load the pre trained embeddings just like we load Glove
. Is my understanding correct?
Here is the code to do that. I want to know if I'm on the right path because I could not find a satisfactory answer to my question anywhere.
tokenizer.fit_on_texts(data) # tokenizer is Keras Tokenizer()
vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words
encoded_docs = tokenizer.texts_to_sequences(data) # data is lists of lists of sentences
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') # max_length is say 30
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) # this will load the binary Word2Vec model
model.save_word2vec_format('GoogleNews-vectors-negative300.txt', binary=False) # this will save the VECTORS in a text file. Can load it using the below function?
def load_embeddings(vocab_size,fitted_tokenizer,emb_file_path,emb_dim=300):
'''
It can load GloVe.txt for sure. But is it the right way to load paragram.txt, fasttext.vec and word2vec.bin if converted to .txt?
'''
embeddings_index = dict()
f = open(emb_file_path)
for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
embedding_matrix = zeros((vocab_size, emb_dim))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
return embedding_matrix
My question is that Can we load the .vec
file directly AND can we load the .bin
file as I have described above with the given load_embeddings()
function?
I have found the answer to this: Please update if there is any problem.
class PreProcess():
# check: https://stackabuse.com/pythons-classmethod-and-staticmethod-explained/ for @staticmethod use
@staticmethod # You don't have to create an object of this class in order access this method. Preprocess.preprocess_data()
def preprocess_data(data:list,max_length:int):
'''
Method to parse, tokenize, build vocab and padding the text data
args:
data: List of all the texts as: ['this is text 1','this is text 2 of different length']
max_length: maximum length to consider for an individual text entry in data
out:
vocab size, fitted tokenizer object, encoded input text and padded input text
'''
tokenizer = Tokenizer() # set num_words, oov_token arguments depending on your usecase
tokenizer.fit_on_texts(data)
vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words which will be all 0s when loading pre trained embeddings
encoded_docs = tokenizer.texts_to_sequences(data)
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
return vocab_size,tokenizer,encoded_docs,padded_docs
@staticmethod
def load_pretrained_embeddings(fitted_tokenizer,vocab_size:int,emb_file:str,emb_dim:int=300,):
'''
All 300D Embeddings: https://www.kaggle.com/reppy4620/embeddings
'''
if '.bin' in emb_file: # if it is binary file, it is not embeddings but the MODEL itself. It could be fasttext or word2vec model
model = KeyedVectors.load_word2vec_format(emb_file, binary=True)
# emb_file = emb_file.replace('.bin','.txt') # general purpose path
emb_file = './new_emb_file.txt' # for Kaggle because you have to save data in out dir only
model.save_word2vec_format(emb_file, binary=False)
# open and read the contents of the .txt / .vec file (.vec is same as .txt file)
embeddings_index = dict()
with open(emb_file,encoding="utf8",errors='ignore') as f:
for i,line in enumerate(f): # each line is as: hello 0.9 0.3 0.5 0.01 0.001 ...
if i>0: # why this? You'll see in most of the Kaggle Kernals as if len(line)>100. It is because there is a difference between GloVe style and Word2Vec style embeddings
# check this link: https://radimrehurek.com/gensim/scripts/glove2word2vec.html
values = line.split(' ')
word = values[0] # first value is "hello"
coefs = np.asarray(values[1:], dtype='float32') # everything else is vector of "hello"
embeddings_index[word] = coefs
# create the embedding matrix or Embedding weights based on your data
embedding_matrix = np.zeros((vocab_size, emb_dim)) # build embeddings based on our vocab size
for word, i in fitted_tokenizer.word_index.items(): # get each vocab token one by one
embedding_vector = embeddings_index.get(word) # get from loaded embeddings
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector # if it is present, just replace the corresponding vectors
return embedding_matrix
@staticmethod
def load_ELMO(data):
pass
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.