简体   繁体   中英

Doc2Vec build_vocab method fails

I am following this guide on building a Doc2Vec gensim model.

I have created an MRE that should highlight this problem:

import pandas as pd, numpy as np, warnings, nltk, string, re, gensim
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

def get_words(para):   
    pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
    stop_words = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    no_dates = [re.sub(pattern, '', i) for i in para.lower().split()]
    no_punctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in no_dates]
    stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
    
    return stemmed_tokens

data_dict = {'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
 'Review': {0: "Even though the restauraunt was gross, the food was still good and I'd recommend it",
  1: 'My waiter was awful, my food was awful, I hate it all',
  2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',
  3: 'Even though the cleanliness level was fantastic, my food was awful',
  4: 'Everything was mediocre, but I guess mediocre is better than bad nowadays',
  5: "Honestly there wasn't a single thing that was mediocre about this place",
  6: 'I could not have enjoyed it more! Perfect',
  7: 'This place is perfectly awful. I think it should shut down to be honest',
  8: "I can't understand how anyone would say something negative",
  9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},
 'Bogus Field 1': {0: 'foo71',
  1: 'foo92',
  2: 'foo25',
  3: 'foo88',
  4: 'foo54',
  5: 'foo10',
  6: 'foo48',
  7: 'foo76',
  8: 'foo4',
  9: 'foo11'},
 'Bogus Field 2': {0: 'foo12',
  1: 'foo66',
  2: 'foo94',
  3: 'foo90',
  4: 'foo97',
  5: 'foo87',
  6: 'foo10',
  7: 'foo4',
  8: 'foo16',
  9: 'foo86'},
 'Sentiment': {0: 1, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 0, 8: 1, 9: 0}}    

 df = pd.DataFrame(data_dict, columns=data_dict.keys())
 train, test = train_test_split(df, test_size=0.3, random_state=8)
 train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']), 
                                                    tags=x['Sentiment']), axis=1,)

model_dbow = Doc2Vec(dm=0, vector_size=50, negative=5, hs=0, min_count=1, sample=0, workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])

Which produces:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    926         total_words, corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928             progress_per=progress_per, trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
   1124 
-> 1125         total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
   1126 
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
   1069             document_length = len(document.words)
   1070 
-> 1071             for tag in document.tags:
   1072                 _note_doctag(tag, document_length, docvecs)
   1073 

TypeError: 'int' object is not iterable

I do not understand where the int type is coming from, as a: print(set([type(x) for x in train_tagged])) yields: {<class 'gensim.models.doc2vec.TaggedDocument'>}

Note, additional troubleshooting such as:

train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])], 
                                                    tags=[x['Sentiment']]), axis=1,)

yields:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    926         total_words, corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928             progress_per=progress_per, trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
   1124 
-> 1125         total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
   1126 
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
   1073 
   1074             for word in document.words:
-> 1075                 vocab[word] += 1
   1076             total_words += len(document.words)
   1077 

TypeError: unhashable type: 'list'

Your first attempt is definitely placing a single value where the TaggedDocument instance requires a list-of-values – even if only a list-with-one-value.

I'm unsure what's wrong in your 2nd attempt, but have you looked at a representative instance of train_tagged , for example train_tagged[0] , to ensure that it is:

  • a single TaggedDocument
  • with a tags value that is a list
  • where each item in that list is a simple string (or in advanced use, an int from a range starting at 0 )

Also note that if train_tagged is the right kind of sequence-of- TaggedDocument -instances, you can and should pass it directly to build_vocab() . (There's no need for the strange [x for x in train_tagged.values] construction.)

More generally, if just getting started with Doc2Vec , beginning with simpler examples in the Gensim docs will work better than things from "Towards Data Science". There's a ton of really-awful code & misguided practices on "Towards Data Science".

You are passing no documents to your actual trainer, see the part with

model_dbow = Doc2Vec(dm=0 , [...])

This 0 is interpreted as an integer, which is why you get the error. Instead, you should simply add your documents as detailed in the gensim docs for Doc2Vec and probably be good to go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM