简体   繁体   English

word2vec:训练语料库中的顺序排列

[英]word2vec: Order of setences in the training corpus

I have a question concerning the word2vec algorithm. 我有一个关于word2vec算法的问题。 In fact, my question is if the order of the sentences in the training corpus is important. 实际上,我的问题是训练语料库中句子的顺序是否重要。 For example, given two training corpus: 例如,给定两个训练语料库:

CorpusA: Sentence 1. Sentence 2. Sentence 3. 语料库A:句子1。句子2。句子3。

CorpusB: Sentence 3. Sentence 1. Sentence 2. 语料库B:句子3。句子1。句子2。

Will the results from word2vec be different? word2vec的结果会有所不同吗?

Thanks in advance 提前致谢

Order of sentences would impact the embedding learnt from the text corpus since most word2vec implementations are trained using SGD. 句子的顺序会影响从文本语料库学到的嵌入,因为大多数word2vec实现都是使用SGD进行训练的。

So answer to your question - yes, results of word2vec be different. 因此,回答您的问题-是的,word2vec的结果有所不同。

I don't think word2vec is the right algorithm to use if order of sentences in the corpus is important to you. 如果语料库中的句子顺序对您来说很重要,我认为word2vec不是正确的算法。 Keep in mind, output of word can vary because of multiple reasons, few of which are - 请记住,由于多种原因,单词的输出可能会有所不同,其中很少有-

  • random initialisation of vectors 向量的随机初始化
  • negative sampling 负采样
  • multi-threading 多线程
  • floating-point precision of your machine 机器的浮点精度

For better results, we do multiple epochs over the training data which won't be possible in your case 为了获得更好的结果,我们对训练数据进行了多个时期的处理,这在您的情况下是不可能的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM