简体繁体中英

word2vec: user-level, document-level embeddings with pre-trained model

原文 2018-11-30 21:40:14 6 2 python/ twitter/ nlp/ word2vec/ word-embedding

I am currently developing a Twitter content-based recommender system and have a word2vec model pre-trained on 400 million tweets.

How would I go about using those word embeddings to create a document/tweet-level embedding and then get the user embedding based on the tweets they had posted?

I was initially intending on averaging those words in a tweet that had a word vector representation and then averaging the document/tweet vectors to get a user vector but I wasn't sure if this was optimal or even correct. Any help is much appreciated.

2 answers

Averaging the vectors of all the words in a short text is one way to get a summary vector for the text. It often works OK as a quick baseline. (And, if all you have, is word-vectors, may be your main option.)

Such a representation might sometimes improve if you did a weighted average based on some other measure of relative term importance (such as TF-IDF), or used raw word-vectors (before normalization to unit-length, as pre-normalization raw magnitudes can sometimes hints at strength-of-meaning).

You could create user-level vectors by averaging all their texts, or by (roughly equivalently) placing all their authored words into a pseudo-document and averaging all those words together.

You might retain more of the variety of a user's posts, especially if their interests span many areas, by first clustering their tweets into N clusters, then modeling the user as the N centroid vectors of the clusters. Maybe even the N varies per user, based on how much they tweet or how far-ranging in topics their tweets seem to be.

With the original tweets, you could also train up per-tweet vectors using an algorithm like 'Paragraph Vector' (aka 'Doc2Vec' in a library like Python gensim.) But, that can have challenging RAM requirements with 400 million distinct documents. (If you have a smaller number of users, perhaps they can be the 'documents', or they could be the predicted classes of a FastText-in-classification-mode training session.)

You are on the right track with averaging the word vectors in a tweet to get a "tweet vector" and then averaging the tweet vectors for each user to get a "user vector". Whether these average vectors will be useful or not depends on your learning task. Hard to say if this average method will work or not without trying since it depends on how diverse is your data set in terms of variation between the words used in tweets by each user.

How to load a pre-trained Word2vec MODEL File?

How to extract a word vector from the Google pre-trained model for word2vec?

How to initialize a new word2vec model with pre-trained model weights?

How to access/use Google's pre-trained Word2Vec model without manually downloading the model?

Gensim word2vec augment or merge pre-trained vectors

Shared memory among processes for pre-trained word2vec model?

How to load a pre-trained Word2vec MODEL File and reuse it?

Word2Vec: Error received at uploading a pre-trained word2vec file using Gensim

Using pre-trained word embeddings in a keras model?

PCA on word2vec embeddings using pre existing model

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to load a pre-trained Word2vec MODEL File? How to extract a word vector from the Google pre-trained model for word2vec? How to initialize a new word2vec model with pre-trained model weights? How to access/use Google's pre-trained Word2Vec model without manually downloading the model? Gensim word2vec augment or merge pre-trained vectors Shared memory among processes for pre-trained word2vec model? How to load a pre-trained Word2vec MODEL File and reuse it? Word2Vec: Error received at uploading a pre-trained word2vec file using Gensim Using pre-trained word embeddings in a keras model? PCA on word2vec embeddings using pre existing model

Related Tags

word2vec: user-level, document-level embeddings with pre-trained model

Question

2 answers

solution1
2 ACCPTED 2018-12-04 11:54:58

solution2
0 2018-11-30 23:22:00

word2vec: user-level, document-level embeddings with pre-trained model

Question

2 answers

solution1 2 ACCPTED 2018-12-04 11:54:58

solution2 0 2018-11-30 23:22:00

solution1
2 ACCPTED 2018-12-04 11:54:58

solution2
0 2018-11-30 23:22:00