简体   繁体   English

Python Gensim word2vec词汇密钥

[英]Python Gensim word2vec vocabulary key

I want to make word2vec with gensim. 我想用gensim制作word2vec。 I heard that vocabulary corpus should be unicode so I converted it to unicode. 我听说词汇语料库应该是unicode所以我把它转换成unicode。

# -*- encoding:utf-8 -*-
# !/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from gensim.models import Word2Vec
import pprint

with open('parsed_data.txt', 'r') as f:
    corpus = map(unicode, f.read().split('\n'))

model = Word2Vec(size=128, window=5, min_count=5, workers=4)
model.build_vocab(corpus,keep_raw_vocab=False)
model.train(corpus)
model.save('w2v')

pprint.pprint(model.most_similar(u'너'))

Above is my source code. 以上是我的源代码。 It seems like work well. 看起来效果很好。 However there are problem with vocabulary key. 但是词汇密钥存在问题。 I want to make korean word2vec which use unicode. 我想制作使用unicode的韩语word2vec。 For example word 사과 which means apology in english and it's unicode is \\xC0AC\\xACFC If I try to find 사과 in word2vec, key error occur... 例如单词사과 ,这意味着用英语道歉,它的unicode是\\xC0AC\\xACFC如果我试图在word2vec中找到사과则会发生键错误...
Instead of \\xc0ac\\xacfc \\xc0ac and \\xacfc stores separately. 而不是\\xc0ac\\xacfc \\xc0ac\\xacfc分开存储。 What's the reason and how to solve it? 是什么原因以及如何解决?

Word2Vec requires text examples that are broken into word-tokens. Word2Vec需要将文本示例分解为word-tokens。 It appears you are simply providing strings to Word2Vec, so when it iterates over them, it will only be seeing single-characters as words. 看起来你只是在为Word2Vec提供字符串,所以当它迭代它们时,它只会看到单个字符作为单词。

Does Korean use spaces to delimit words? 韩国人是否使用空格来划分单词? If so, break your texts by spaces before handing the list-of-words as a text example to Word2Vec. 如果是这样,请在将单词列表作为文本示例传递给Word2Vec之前,按空格分隔文本。

If not, you'll need to use some external word-tokenizer (not part of gensim) before passing your sentences to Word2Vec. 如果没有,在将句子传递给Word2Vec之前,你需要使用一些外部的word-tokenizer(不是gensim的一部分)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM