簡體   English   中英

Doc2Vec build_vocab 方法失敗

[英]Doc2Vec build_vocab method fails

我正在按照本指南構建Doc2Vec gensim model。


import pandas as pd, numpy as np, warnings, nltk, string, re, gensim
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

def get_words(para):   
    pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
    stop_words = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    no_dates = [re.sub(pattern, '', i) for i in para.lower().split()]
    no_punctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in no_dates]
    stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
    return stemmed_tokens

data_dict = {'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
 'Review': {0: "Even though the restauraunt was gross, the food was still good and I'd recommend it",
  1: 'My waiter was awful, my food was awful, I hate it all',
  2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',
  3: 'Even though the cleanliness level was fantastic, my food was awful',
  4: 'Everything was mediocre, but I guess mediocre is better than bad nowadays',
  5: "Honestly there wasn't a single thing that was mediocre about this place",
  6: 'I could not have enjoyed it more! Perfect',
  7: 'This place is perfectly awful. I think it should shut down to be honest',
  8: "I can't understand how anyone would say something negative",
  9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},
 'Bogus Field 1': {0: 'foo71',
  1: 'foo92',
  2: 'foo25',
  3: 'foo88',
  4: 'foo54',
  5: 'foo10',
  6: 'foo48',
  7: 'foo76',
  8: 'foo4',
  9: 'foo11'},
 'Bogus Field 2': {0: 'foo12',
  1: 'foo66',
  2: 'foo94',
  3: 'foo90',
  4: 'foo97',
  5: 'foo87',
  6: 'foo10',
  7: 'foo4',
  8: 'foo16',
  9: 'foo86'},
 'Sentiment': {0: 1, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 0, 8: 1, 9: 0}}    

 df = pd.DataFrame(data_dict, columns=data_dict.keys())
 train, test = train_test_split(df, test_size=0.3, random_state=8)
 train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']), 
                                                    tags=x['Sentiment']), axis=1,)

model_dbow = Doc2Vec(dm=0, vector_size=50, negative=5, hs=0, min_count=1, sample=0, workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])


TypeError                                 Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    926         total_words, corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928             progress_per=progress_per, trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
-> 1125         total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
   1069             document_length = len(document.words)
-> 1071             for tag in document.tags:
   1072                 _note_doctag(tag, document_length, docvecs)

TypeError: 'int' object is not iterable

我不明白int類型來自哪里,因為 a: print(set([type(x) for x in train_tagged])) yield: {<class 'gensim.models.doc2vec.TaggedDocument'>}


train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])], 
                                                    tags=[x['Sentiment']]), axis=1,)


TypeError                                 Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    926         total_words, corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928             progress_per=progress_per, trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
-> 1125         total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
   1074             for word in document.words:
-> 1075                 vocab[word] += 1
   1076             total_words += len(document.words)

TypeError: unhashable type: 'list'


我不確定您的第二次嘗試出了什么問題,但是您是否查看了train_tagged的代表性實例,例如train_tagged[0] ,以確保它是:

  • 單個TaggedDocument
  • 帶有一個listtags
  • 該列表中的每個項目都是一個簡單的字符串(或在高級使用中,從0開始的范圍內的int

另請注意,如果train_tagged是正確的TaggedDocument序列,您可以並且應該將其直接傳遞給build_vocab() (不需要奇怪的[x for x in train_tagged.values]構造。)

更一般地說,如果剛開始使用Doc2Vec ,從 Gensim 文檔中更簡單的示例開始會比“Towards Data Science”中的內容更好。 在“走向數據科學”中有大量非常糟糕的代碼和被誤導的做法。


model_dbow = Doc2Vec(dm=0 , [...])

0被解釋為 integer,這就是您收到錯誤的原因。 相反,您應該簡單地添加您的文檔,如 Doc2Vec 的gensim 文檔中詳述的那樣,並且可能對 go 很好。


聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

粵ICP備18138465號  © 2020-2024 STACKOOM.COM