简体繁体 English

如何为 doc2vec 选择最佳 vector_size？

[英]how to choose the best vector_size for doc2vec?

原文 2020-08-14 19:26:35 5 1 text/ size/ document/ doc2vec/ embedded-documents

I am comparing techniques and want to find out what is the best method to vector and reduce dimensions of a large number of text documents.我正在比较技术，并想找出对大量文本文档进行矢量化和减少维度的最佳方法。 I have already tested Bag of Words and TF-IDF and reduced dimensions with PCA, SVD, and NMF.我已经用 PCA、SVD 和 NMF 测试了 Bag of Words 和 TF-IDF 并缩减了维度。 Using these approaches I can reduce my data and know the best number of dimensions based on the variance explained.使用这些方法，我可以减少我的数据，并根据解释的方差了解最佳维数。

However, I want to do the same with doc2vec, considering that doc2vec itself is a dimensional reducer, what is the best approach to find out the number of dimensions for my model?但是，我想对 doc2vec 做同样的事情，考虑到 doc2vec 本身是一个降维器，找出我的 model 的维数的最佳方法是什么？ Is there any statistical measure that helps me find the best number of vector_size?是否有任何统计措施可以帮助我找到最佳的 vector_size 数量？

Thanks in advance!提前致谢！

1 个解决方案

There's no magic indicator for what's best;没有什么是最好的神奇指标。 you should try a range of dimensionalities to see what scores well on your specific downstream evaluations, given your data & goals.鉴于您的数据和目标，您应该尝试一系列维度，以查看在您的特定下游评估中哪些得分较高。

If using a doc2vec implementation that offers inference of out-of-training set documents (such as via the .infer_vector() method in Python gensim library), then a plausible sanity check for eliminating very-bad choices of vector_size (or other parameters) is to re-infer vectors for training-set documents.如果使用提供训练集外文档推断的 doc2vec 实现（例如通过 Python gensim 库中的.infer_vector()方法），则可以进行合理的健全性检查，以消除非常糟糕的vector_size选择（或其他参数）是重新推断训练集文档的向量。

If repeated re-inferences of the same text are are generally "close to" each other, and to the vector for that same document created by the full model training, that's a weak indicator that the model is at least behaving in a self-consistent way.如果相同文本的重复重新推理通常彼此“接近”，并且对于由完整 model 训练创建的同一文档的向量，这是一个弱指标，表明 model 至少表现自洽方法。 (If the spread of results is large, that might indicate potential problems with insufficient data, too few training epochs, a too-large/overfit model, or other foundational issues.) （如果结果的分布很大，这可能表明数据不足、训练时期太少、model 过大/过拟合或其他基本问题的潜在问题。）