[英]Python Gensim word2vec vocabulary key
I want to make word2vec with gensim. 我想用gensim制作word2vec。 I heard that vocabulary corpus should be unicode so I converted it to unicode.
我听说词汇语料库应该是unicode所以我把它转换成unicode。
# -*- encoding:utf-8 -*-
# !/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from gensim.models import Word2Vec
import pprint
with open('parsed_data.txt', 'r') as f:
corpus = map(unicode, f.read().split('\n'))
model = Word2Vec(size=128, window=5, min_count=5, workers=4)
model.build_vocab(corpus,keep_raw_vocab=False)
model.train(corpus)
model.save('w2v')
pprint.pprint(model.most_similar(u'너'))
Above is my source code. 以上是我的源代码。 It seems like work well.
看起来效果很好。 However there are problem with vocabulary key.
但是词汇密钥存在问题。 I want to make korean word2vec which use unicode.
我想制作使用unicode的韩语word2vec。 For example word
사과
which means apology in english and it's unicode is \\xC0AC\\xACFC
If I try to find 사과
in word2vec, key error occur... 例如单词
사과
,这意味着用英语道歉,它的unicode是\\xC0AC\\xACFC
如果我试图在word2vec中找到사과
则会发生键错误...
Instead of \\xc0ac\\xacfc
\\xc0ac
and \\xacfc
stores separately. 而不是
\\xc0ac\\xacfc
\\xc0ac
和\\xacfc
分开存储。 What's the reason and how to solve it? 是什么原因以及如何解决?
Word2Vec requires text examples that are broken into word-tokens. Word2Vec需要将文本示例分解为word-tokens。 It appears you are simply providing strings to Word2Vec, so when it iterates over them, it will only be seeing single-characters as words.
看起来你只是在为Word2Vec提供字符串,所以当它迭代它们时,它只会看到单个字符作为单词。
Does Korean use spaces to delimit words? 韩国人是否使用空格来划分单词? If so, break your texts by spaces before handing the list-of-words as a text example to Word2Vec.
如果是这样,请在将单词列表作为文本示例传递给Word2Vec之前,按空格分隔文本。
If not, you'll need to use some external word-tokenizer (not part of gensim) before passing your sentences to Word2Vec. 如果没有,在将句子传递给Word2Vec之前,你需要使用一些外部的word-tokenizer(不是gensim的一部分)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.