简体   繁体   English

如何为 doc2vec 选择最佳 vector_size?

[英]how to choose the best vector_size for doc2vec?

I am comparing techniques and want to find out what is the best method to vector and reduce dimensions of a large number of text documents.我正在比较技术,并想找出对大量文本文档进行矢量化和减少维度的最佳方法。 I have already tested Bag of Words and TF-IDF and reduced dimensions with PCA, SVD, and NMF.我已经用 PCA、SVD 和 NMF 测试了 Bag of Words 和 TF-IDF 并缩减了维度。 Using these approaches I can reduce my data and know the best number of dimensions based on the variance explained.使用这些方法,我可以减少我的数据,并根据解释的方差了解最佳维数。

However, I want to do the same with doc2vec, considering that doc2vec itself is a dimensional reducer, what is the best approach to find out the number of dimensions for my model?但是,我想对 doc2vec 做同样的事情,考虑到 doc2vec 本身是一个降维器,找出我的 model 的维数的最佳方法是什么? Is there any statistical measure that helps me find the best number of vector_size?是否有任何统计措施可以帮助我找到最佳的 vector_size 数量?

Thanks in advance!提前致谢!

There's no magic indicator for what's best;没有什么是最好的神奇指标。 you should try a range of dimensionalities to see what scores well on your specific downstream evaluations, given your data & goals.鉴于您的数据和目标,您应该尝试一系列维度,以查看在您的特定下游评估中哪些得分较高。

If using a doc2vec implementation that offers inference of out-of-training set documents (such as via the .infer_vector() method in Python gensim library), then a plausible sanity check for eliminating very-bad choices of vector_size (or other parameters) is to re-infer vectors for training-set documents.如果使用提供训练集外文档推断的 doc2vec 实现(例如通过 Python gensim 库中的.infer_vector()方法),则可以进行合理的健全性检查,以消除非常糟糕的vector_size选择(或其他参数)是重新推断训练集文档的向量。

If repeated re-inferences of the same text are are generally "close to" each other, and to the vector for that same document created by the full model training, that's a weak indicator that the model is at least behaving in a self-consistent way.如果相同文本的重复重新推理通常彼此“接近”,并且对于由完整 model 训练创建的同一文档的向量,这是一个弱指标,表明 model 至少表现自洽方法。 (If the spread of results is large, that might indicate potential problems with insufficient data, too few training epochs, a too-large/overfit model, or other foundational issues.) (如果结果的分布很大,这可能表明数据不足、训练时期太少、model 过大/过拟合或其他基本问题的潜在问题。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用doc2vec的人类可解释的有意义的集群 - human-interpretable, meaningful clusters using doc2vec 如果我的文本数据逐渐增加,可以使用doc2vec吗? - Can doc2vec be used if my text data is incrementally increasing? 如何使用单词的向量表示(从Word2Vec等获得)作为分类器的特征? - How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier? CKEditor-我可以将文本锁定在适当位置还是将文档设置为其他大小? - CKEditor - Can I lock text in place or make the doc another size? 如何在Perl中的doc文档中打印BOLD文本? - How to print BOLD text in here doc in Perl? 如何在Ascii Doc中进行断字? - How to do word break in Ascii Doc? 如何在Android Studio查找中选择代码字体? - How to choose code font in Android Studio lookup? 如何在visual basic中选择随机文本? - How to choose a random text in visual basic? 如何根据元素的宽度和高度选择变量? - How to choose a variable depending on the width and height of the element? 如何使用NPOI for .net从.doc文件中读取(导入)文本 - How to read(import) text from a .doc file using NPOI for .net
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM