简体   繁体   中英

Python Gensim word2vec vocabulary key

I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode.

# -*- encoding:utf-8 -*-
# !/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from gensim.models import Word2Vec
import pprint

with open('parsed_data.txt', 'r') as f:
    corpus = map(unicode, f.read().split('\n'))

model = Word2Vec(size=128, window=5, min_count=5, workers=4)
model.build_vocab(corpus,keep_raw_vocab=False)
model.train(corpus)
model.save('w2v')

pprint.pprint(model.most_similar(u'너'))

Above is my source code. It seems like work well. However there are problem with vocabulary key. I want to make korean word2vec which use unicode. For example word 사과 which means apology in english and it's unicode is \\xC0AC\\xACFC If I try to find 사과 in word2vec, key error occur...
Instead of \\xc0ac\\xacfc \\xc0ac and \\xacfc stores separately. What's the reason and how to solve it?

Word2Vec requires text examples that are broken into word-tokens. It appears you are simply providing strings to Word2Vec, so when it iterates over them, it will only be seeing single-characters as words.

Does Korean use spaces to delimit words? If so, break your texts by spaces before handing the list-of-words as a text example to Word2Vec.

If not, you'll need to use some external word-tokenizer (not part of gensim) before passing your sentences to Word2Vec.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM