简体   繁体   中英

Looking to cluster short descriptions of reports. Should I use Word2Vec or Doc2Vec

So, I have close to 2000 reports and each report has an associated short description of the problem. My goal is to cluster all of these so that we can find distinct trends within these reports.

One of the features I'd like to use some sort of contextual text vector. Now, I've used Word2Vec and think this would be a good option but I also so Doc2Vec and I'm not quite sure what would be a better option for this use case.

Any feedback would be greatly appreciated.

They're very similar, so just as with a single approach, you'd try tuning parameters to improve results in some rigorous manner, you should try them both, and compare the results.

Your dataset sounds tiny compared to what either needs to induce good vectors – Word2Vec is best trained on corpuses of many millions to billions of words, while Doc2Vec's published results rely on tens-of-thousands to millions of documents.

If composing some summary-vector-of-the-document from word-vectors, you could potentially leverage word-vectors that are reused from elsewhere, but that will work best if the vectors' original training corpus is similar in vocabulary/domain-language-usage to your corpus. For example, don't expect words trained on formal news writing to work well with, or even cover the same vocabulary as, informal tweets, or vice-versa.

If you had a larger similar-text corpus of documents to train a Doc2Vec model, you could potentially train a good model on the full set of documents, but then just use your small subset, or re-infer vectors for your small subset, and get better results than a model that was only trained on your subset.

Strictly for clustering, and with your current small corpus of short texts, if you have good word-vectors from elsewhere, it may be worth looking at the "Word Mover's Distance" method of calculating pairwise document-to-document similarity. It can be expensive to calculate on larger docs and large document-sets, but might support clustering well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM