简体   繁体   中英

Word2Vec with chinese

I have been learning about Word2Vec(Deeplearning4j) but i could find not anything about it supporting Chinese. From various sources I got to know that it can work for chinese also by using some plugin.

So please tell me any plugin for chinese, also how it should be implemented with word2vec.

And if Deeplearning4j Word2Vec is good or not for english and chinese language(both) support. If not please suggest some better choice with it's link.

Language : Java

I don't know java, but I can show you how to use python to do this:

import jieba
import gensim
q = [u'我到河北省来', u'好棒好棒哒']
z = [list(jieba.cut(i)) for i in q]
model = gensim.models.Word2Vec(z, min_count=1)
model.similar_by_word(u'我')

the result is not good since the training data is very-very few. If add more data, the result will be better. And for your condition, you can use a Tokenizer written by Java, and do the same work as jieba library, then just put the right format data to model and train it.

The word2vec is only a dataset of word-vectors, in most cases,it's a text file, each line contains a "word" and its word vector separated by space (or tab).

You can train this word2vec in any programming language. Load a text file shouldn't be a problem for you.

In terms of Chinese, I would suggest 3 tools:

1) the Character-enhanced Word Embedding (c++)

Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, Huanbo Luan. Joint Learning of Character and Word Embeddings. The 25th International Joint Conference on Artificial Intelligence (IJCAI 2015).

Please noticed that the output of CWE is separated by tab (\\t)

2) Fast text by Facebook (c++)

Fasttext could train on Chinese, it is build on character n-grams. In my paper:

Aicyber's System for IALP 2016 Shared Task: Character-enhanced Word Vectors and Boosted Neural Networks

I set the minimum character n-gram to 1 for Chinese.

3) Gensim (python)

@Howardyan had show you the code for using gensim, including the tokenizer. Please noticed that the default training method is CBOW for gensim. Skip-gram may give you better results depends on your data. And here is a comparison on gensim and Fasttext.

PS: Both 1) 2) support training the original word2vec.

As mentioned in other comments, word2vec is a set of word with pretrained English word-vectors. Likewise, you can find other dataset that contains Chinese word-vectors. I am working with python, but I think the programming language does not matter since what you are looking for is a dataset instead of a model or program.

Here is a Chinese word embedding dataset trained by Tencent AI Lab containing over 8 million Chinese words and phrases: https://ai.tencent.com/ailab/nlp/en/embedding.html

Deeplearning4j can support any language. You just have to implement a custom tokenizer. See: https://github.com/deeplearning4j/deeplearning4j-nlp-addons for an example in japanese.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM