简体   繁体   English

Word2Vec 中文

[英]Word2Vec with chinese

I have been learning about Word2Vec(Deeplearning4j) but i could find not anything about it supporting Chinese.我一直在学习 Word2Vec(Deeplearning4j) 但我找不到任何关于它支持中文的内容。 From various sources I got to know that it can work for chinese also by using some plugin.从各种来源我了解到它也可以通过使用一些插件来为中文工作。

So please tell me any plugin for chinese, also how it should be implemented with word2vec.所以请告诉我任何中文插件,以及它应该如何用word2vec实现。

And if Deeplearning4j Word2Vec is good or not for english and chinese language(both) support.如果 Deeplearning4j Word2Vec 是否适合英语和中文(两者)支持。 If not please suggest some better choice with it's link.如果没有,请通过它的链接建议一些更好的选择。

Language : Java语言:Java

I don't know java, but I can show you how to use python to do this:我不懂 java,但我可以告诉你如何使用 python 来做到这一点:

import jieba
import gensim
q = [u'我到河北省来', u'好棒好棒哒']
z = [list(jieba.cut(i)) for i in q]
model = gensim.models.Word2Vec(z, min_count=1)
model.similar_by_word(u'我')

the result is not good since the training data is very-very few.结果不好,因为训练数据很少。 If add more data, the result will be better.如果添加更多数据,结果会更好。 And for your condition, you can use a Tokenizer written by Java, and do the same work as jieba library, then just put the right format data to model and train it.而对于你的情况,你可以使用Java编写的Tokenizer,和jieba库做同样的工作,然后将正确格式的数据放入模型并训练它。

The word2vec is only a dataset of word-vectors, in most cases,it's a text file, each line contains a "word" and its word vector separated by space (or tab). word2vec 只是一个词向量的数据集,在大多数情况下,它是一个文本文件,每行包含一个“词”,它的词向量用空格(或制表符)分隔。

You can train this word2vec in any programming language.你可以用任何编程语言训练这个 word2vec。 Load a text file shouldn't be a problem for you.加载文本文件对您来说应该不是问题。

In terms of Chinese, I would suggest 3 tools:就中文而言,我建议使用 3 个工具:

1) the Character-enhanced Word Embedding (c++) 1) Character-enhanced Word Embedding (c++)

Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, Huanbo Luan.陈新雄、徐磊、刘志远、孙茂松、栾焕波。 Joint Learning of Character and Word Embeddings.字符和词嵌入的联合学习。 The 25th International Joint Conference on Artificial Intelligence (IJCAI 2015).第 25 届国际人工智能联合会议 (IJCAI 2015)。

Please noticed that the output of CWE is separated by tab (\\t)请注意,CWE 的输出以制表符 (\\t) 分隔

2) Fast text by Facebook (c++) 2) Facebook的快速文本(c ++)

Fasttext could train on Chinese, it is build on character n-grams. Fasttext 可以训练中文,它是建立在字符 n-gram 上的。 In my paper:在我的论文中:

Aicyber's System for IALP 2016 Shared Task: Character-enhanced Word Vectors and Boosted Neural Networks Aicyber 的 IALP 2016 共享任务系统:字符增强词向量和增强神经网络

I set the minimum character n-gram to 1 for Chinese.我将中文的最小字符 n-gram 设置为 1。

3) Gensim (python) 3)Gensim(蟒蛇)

@Howardyan had show you the code for using gensim, including the tokenizer. @Howardyan 向您展示了使用 gensim 的代码,包括标记器。 Please noticed that the default training method is CBOW for gensim.请注意,gensim 的默认训练方法是 CBOW。 Skip-gram may give you better results depends on your data. Skip-gram 可能会给你更好的结果,这取决于你的数据。 And here is a comparison on gensim and Fasttext.这是对gensim 和 Fasttext的比较

PS: Both 1) 2) support training the original word2vec. PS:两者1)2)都支持训练原始word2vec。

As mentioned in other comments, word2vec is a set of word with pretrained English word-vectors.正如其他评论中提到的,word2vec 是一组带有预训练英文词向量的词。 Likewise, you can find other dataset that contains Chinese word-vectors.同样,您可以找到其他包含中文词向量的数据集。 I am working with python, but I think the programming language does not matter since what you are looking for is a dataset instead of a model or program.我正在使用 python,但我认为编程语言并不重要,因为您要查找的是数据集而不是模型或程序。

Here is a Chinese word embedding dataset trained by Tencent AI Lab containing over 8 million Chinese words and phrases: https://ai.tencent.com/ailab/nlp/en/embedding.html以下是腾讯人工智能实验室训练的中文词嵌入数据集,包含超过 800 万个中文词组: https : //ai.tencent.com/ailab/nlp/en/embedding.html

Deeplearning4j can support any language. Deeplearning4j 可以支持任何语言。 You just have to implement a custom tokenizer.您只需要实现一个自定义标记器。 See: https://github.com/deeplearning4j/deeplearning4j-nlp-addons for an example in japanese.请参阅: https : //github.com/deeplearning4j/deeplearning4j-nlp-addons以获取日语示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM