简体   繁体   中英

How to use doc2vec to assign labels to enron dataset

I am using enron emails dataset. I have to assign 5 labels to these classes, namely: appreciation, escalation, sending_document, request_for_document, meeting_invites. Now, I have used doc2vec to assign labels to it using:

emails_df['tokenized_sents'] = emails_df.iloc[0:1000].apply(lambda row: nltk.word_tokenize(row['content']), axis=1)

common_texts = [
                ['We' ,'were', 'impressed', 'with' ,'the' ,'work', 'produced' ,'by' ,'you' ,'and' ,'you' ,'showed' ,'leadership', 'qualities' ,'that' 'the' ,'rest' ,'of' ,'the', 'team' ,'could' ,'look', 'up' ,'to'],

                ['Finish' ,'the' ,'financial' ,'analysis', 'report', 'that' ,'was' ,'started' ,'last' ,'week'],

                ['Please', 'find', 'attached'],

                ['Looking', 'forward', 'to' ,'hearing' ,'from', 'you'],

                ['The' , 'meeting', 'will', 'take', 'place', 'on', 'Wednesday'],

                ['forwarded', 'to', 'xx']



    ]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
labels = []
#print (documents)

model = Doc2Vec(documents, size=5, window=3, min_count=1, workers=4)
#Persist a model to disk:

from gensim.test.utils import get_tmpfile
fname = get_tmpfile("my_doc2vec_model")

#print (fname)
#output: C:\Users\userABC\AppData\Local\Temp\my_doc2vec_model

#load model from saved file
model.save(fname)
model = Doc2Vec.load(fname)  
# you can continue training with the loaded model!
#If you’re finished training a model (=no more updates, only querying, reduce memory usage), you can do:

model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

#Infer vector for a new document:
#Here our text paragraph just 2 words
c=0
for i in emails_df['tokenized_sents']: 
    vector = model.infer_vector(i)
    c=c+1
    import operator
    index, value = max(enumerate(vector), key=operator.itemgetter(1))
    labels.append(index)

Here, emails_df is simply the dataframe which I read emails.csv to. I don't need a perfect labeler but I need something worthwhile. Which direction should I go to from now on to improve it a bit? (considering this is the first time I came to know about doc2vec)

Edit: Explanation: I have created common_texts as a feature vector that contains sentences belonging to each class. And then I apply doc2vec and then use it's function of infer_vector to generate similarities

Doc2Vec requires a lot of data to train useful "dense embedding" vectors for texts. It's not likely to give good results with just a handful of training texts, as with your 6 short common_texts – even if you reduce the vector0size to just 5 dimensions.

(Published Doc2Vec work often uses tens-of-thousands to millions of training documents, to train doc-vectors with 100-1000 dimensions.)

But then further, these vectors do not have each of their individual dimensions as interpretable categories. Rather, they are "dense" vectors, where there's no a priori assignment of meaning to individual axes. Instead, all training docs are "packed" into a shared space, where their relative distances, or relative directions, may indicate strength-of-relationships.

So, your code to pick a label for each document based on which dimension of its doc-vector is the largest positive value is a nonsensical misuse of Doc2Vec -style vectors.

It would help to more clearly state your actual goals: what kind of labels are you trying to assign, and why?

In particular, it would be more appropriate to:

  • train the Doc2Vec model on all the email texts

  • if you have known-labels for some of the emails, but then want to figure out labels for other emails, then use the doc-vectors as an input to a separate "classification" step.

  • if you don't have known-labels, but want to discover what sorts of natural groupings might exist in the emails, as modeled by Doc2Vec , then you'd use the doc-vectors as an input to a separate "clustering" step – and then further examine/analyze the resulting clusters, to see if they're sensible for your needs or reveal patterns interesting to your project.

(There are many online tutorial examples using Python machine-learning tools to classify emails from the Enron dataset. I'd suggest successfully working through one or more of those – even if they don't use Doc2Vec – to understand the general classifier-training then classifer-testing and finally classifier-application process. Only then, consider Doc2Vec as an extra source of 'features' to add to the classification effort.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM