I am following this guide on building a Doc2Vec gensim
model.
I have created an MRE that should highlight this problem:
import pandas as pd, numpy as np, warnings, nltk, string, re, gensim
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
def get_words(para):
pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
no_dates = [re.sub(pattern, '', i) for i in para.lower().split()]
no_punctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in no_dates]
stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
return stemmed_tokens
data_dict = {'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Review': {0: "Even though the restauraunt was gross, the food was still good and I'd recommend it",
1: 'My waiter was awful, my food was awful, I hate it all',
2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',
3: 'Even though the cleanliness level was fantastic, my food was awful',
4: 'Everything was mediocre, but I guess mediocre is better than bad nowadays',
5: "Honestly there wasn't a single thing that was mediocre about this place",
6: 'I could not have enjoyed it more! Perfect',
7: 'This place is perfectly awful. I think it should shut down to be honest',
8: "I can't understand how anyone would say something negative",
9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},
'Bogus Field 1': {0: 'foo71',
1: 'foo92',
2: 'foo25',
3: 'foo88',
4: 'foo54',
5: 'foo10',
6: 'foo48',
7: 'foo76',
8: 'foo4',
9: 'foo11'},
'Bogus Field 2': {0: 'foo12',
1: 'foo66',
2: 'foo94',
3: 'foo90',
4: 'foo97',
5: 'foo87',
6: 'foo10',
7: 'foo4',
8: 'foo16',
9: 'foo86'},
'Sentiment': {0: 1, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 0, 8: 1, 9: 0}}
df = pd.DataFrame(data_dict, columns=data_dict.keys())
train, test = train_test_split(df, test_size=0.3, random_state=8)
train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']),
tags=x['Sentiment']), axis=1,)
model_dbow = Doc2Vec(dm=0, vector_size=50, negative=5, hs=0, min_count=1, sample=0, workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])
Which produces:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
926 total_words, corpus_count = self.vocabulary.scan_vocab(
927 documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928 progress_per=progress_per, trim_rule=trim_rule
929 )
930 self.corpus_count = corpus_count
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
1123 documents = TaggedLineDocument(corpus_file)
1124
-> 1125 total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
1126
1127 logger.info(
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
1069 document_length = len(document.words)
1070
-> 1071 for tag in document.tags:
1072 _note_doctag(tag, document_length, docvecs)
1073
TypeError: 'int' object is not iterable
I do not understand where the int
type is coming from, as a: print(set([type(x) for x in train_tagged]))
yields: {<class 'gensim.models.doc2vec.TaggedDocument'>}
Note, additional troubleshooting such as:
train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])],
tags=[x['Sentiment']]), axis=1,)
yields:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
926 total_words, corpus_count = self.vocabulary.scan_vocab(
927 documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928 progress_per=progress_per, trim_rule=trim_rule
929 )
930 self.corpus_count = corpus_count
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
1123 documents = TaggedLineDocument(corpus_file)
1124
-> 1125 total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
1126
1127 logger.info(
c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
1073
1074 for word in document.words:
-> 1075 vocab[word] += 1
1076 total_words += len(document.words)
1077
TypeError: unhashable type: 'list'
Your first attempt is definitely placing a single value where the TaggedDocument
instance requires a list-of-values – even if only a list-with-one-value.
I'm unsure what's wrong in your 2nd attempt, but have you looked at a representative instance of train_tagged
, for example train_tagged[0]
, to ensure that it is:
TaggedDocument
tags
value that is a list
int
from a range starting at 0
) Also note that if train_tagged
is the right kind of sequence-of- TaggedDocument
-instances, you can and should pass it directly to build_vocab()
. (There's no need for the strange [x for x in train_tagged.values]
construction.)
More generally, if just getting started with Doc2Vec
, beginning with simpler examples in the Gensim docs will work better than things from "Towards Data Science". There's a ton of really-awful code & misguided practices on "Towards Data Science".
You are passing no documents to your actual trainer, see the part with
model_dbow = Doc2Vec(dm=0 , [...])
This 0
is interpreted as an integer, which is why you get the error. Instead, you should simply add your documents as detailed in the gensim docs for Doc2Vec and probably be good to go.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.