简体   繁体   中英

word2vec: user-level, document-level embeddings with pre-trained model

I am currently developing a Twitter content-based recommender system and have a word2vec model pre-trained on 400 million tweets.

How would I go about using those word embeddings to create a document/tweet-level embedding and then get the user embedding based on the tweets they had posted?

I was initially intending on averaging those words in a tweet that had a word vector representation and then averaging the document/tweet vectors to get a user vector but I wasn't sure if this was optimal or even correct. Any help is much appreciated.

Averaging the vectors of all the words in a short text is one way to get a summary vector for the text. It often works OK as a quick baseline. (And, if all you have, is word-vectors, may be your main option.)

Such a representation might sometimes improve if you did a weighted average based on some other measure of relative term importance (such as TF-IDF), or used raw word-vectors (before normalization to unit-length, as pre-normalization raw magnitudes can sometimes hints at strength-of-meaning).

You could create user-level vectors by averaging all their texts, or by (roughly equivalently) placing all their authored words into a pseudo-document and averaging all those words together.

You might retain more of the variety of a user's posts, especially if their interests span many areas, by first clustering their tweets into N clusters, then modeling the user as the N centroid vectors of the clusters. Maybe even the N varies per user, based on how much they tweet or how far-ranging in topics their tweets seem to be.

With the original tweets, you could also train up per-tweet vectors using an algorithm like 'Paragraph Vector' (aka 'Doc2Vec' in a library like Python gensim.) But, that can have challenging RAM requirements with 400 million distinct documents. (If you have a smaller number of users, perhaps they can be the 'documents', or they could be the predicted classes of a FastText-in-classification-mode training session.)

You are on the right track with averaging the word vectors in a tweet to get a "tweet vector" and then averaging the tweet vectors for each user to get a "user vector". Whether these average vectors will be useful or not depends on your learning task. Hard to say if this average method will work or not without trying since it depends on how diverse is your data set in terms of variation between the words used in tweets by each user.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM