简体   繁体   中英

How to embed user names in word2vec model in gensim

I have some volunteer essay writings in the format of:

volunteer_names, essay
["emi", "jenne", "john"], [["lets", "protect", "nature"], ["what", "is", "nature"], ["nature", "humans", "earth"]]
["jenne", "li"], [["lets", "manage", "waste"]]
["emi", "li", "jim"], [["python", "is", "cool"]]
...
...
...

I want to identify the similar users based on their essay writings. I feel like word2vec is more suitable in problems like this. However, since I want to embed user names too in the model I am not sure how to do it. The examples I found in the internet only uses the words (See example code).

import gensim 
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

In that case, I am wondering if there is special way of doing this in word2vec or can I simply consider user names as just words to input to the model. please let me know your thoughts on this.

I am happy to provide more details if needed.

Word2vec infers the word representation from surrounding words: words similarly often appear in a similar company end up with similar vectors. Usually, a window of 5 words is considered. So, if you want to hack Word2vec, you would need to make sure that the student names will appear frequently enough (perhaps at a beginning and at the end of a sentence or something like that).

Alternatively, you can have a look at Doc2vec. During training, each document gets an ID and learns an embedding for the ID, they are in a lookup table as if they were word embeddings. If you use student names as document IDs, you would get student embeddings. If you have multiple essays from one student, I suppose you would need to hack Gensim a little bit not to have a unique ID for each essay.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM