简体   繁体   中英

load pre-trained word embeddings

I want to load pre-trained word embeddings from google news

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print (model.wv.vocab)

But the error is showing:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2022' in position 62425: ordinal not in range(128)

How do I fix this? as I want to list all the words in the word embeddings and do the average for the sentence embedding.

I'm loading them the same way and don't have that problem - I suspect that it's the print statement. Probably your stdout is setup for ascii only, whether it's in jupyter or on a terminal. To avoid that problem, I'd suggest opening a file with encoding like

with open("vocab.txt", "w", encoding="utf8") as vocab_out:
    for word in model.wv.vocab:
        vocab_out.write(word + "\n")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM