簡體   English   中英

Doc2Vec build_vocab 方法失敗

[英]Doc2Vec build_vocab method fails

我正在按照本指南構建Doc2Vec gensim model。

我創建了一個MRE來突出這個問題:

import pandas as pd, numpy as np, warnings, nltk, string, re, gensim
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

def get_words(para):   
    pattern = '([\d]|[\d][\d])\/([\d]|[\d][\d]\/([\d]{4}))'
    stop_words = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    no_dates = [re.sub(pattern, '', i) for i in para.lower().split()]
    no_punctuation = [nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in no_dates]
    stemmed_tokens = [stemmer.stem(word) for word in no_punctuation if word.strip() and len(word) > 1 and word not in stop_words]
    
    return stemmed_tokens

data_dict = {'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
 'Review': {0: "Even though the restauraunt was gross, the food was still good and I'd recommend it",
  1: 'My waiter was awful, my food was awful, I hate it all',
  2: 'I did not enjoy the food very much but I thought the waitstaff was fantastic',
  3: 'Even though the cleanliness level was fantastic, my food was awful',
  4: 'Everything was mediocre, but I guess mediocre is better than bad nowadays',
  5: "Honestly there wasn't a single thing that was mediocre about this place",
  6: 'I could not have enjoyed it more! Perfect',
  7: 'This place is perfectly awful. I think it should shut down to be honest',
  8: "I can't understand how anyone would say something negative",
  9: "It killed me. I'm writing this review as a ghost. That's how bad it was."},
 'Bogus Field 1': {0: 'foo71',
  1: 'foo92',
  2: 'foo25',
  3: 'foo88',
  4: 'foo54',
  5: 'foo10',
  6: 'foo48',
  7: 'foo76',
  8: 'foo4',
  9: 'foo11'},
 'Bogus Field 2': {0: 'foo12',
  1: 'foo66',
  2: 'foo94',
  3: 'foo90',
  4: 'foo97',
  5: 'foo87',
  6: 'foo10',
  7: 'foo4',
  8: 'foo16',
  9: 'foo86'},
 'Sentiment': {0: 1, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 0, 8: 1, 9: 0}}    

 df = pd.DataFrame(data_dict, columns=data_dict.keys())
 train, test = train_test_split(df, test_size=0.3, random_state=8)
 train_tagged = train.apply(lambda x: TaggedDocument(words=get_words(x['Review']), 
                                                    tags=x['Sentiment']), axis=1,)

model_dbow = Doc2Vec(dm=0, vector_size=50, negative=5, hs=0, min_count=1, sample=0, workers=8)
model_dbow.build_vocab([x for x in train_tagged.values])

產生:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-590096b99bf9> in <module>
----> 1 model_dbow.build_vocab([x for x in train_tagged.values])

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    926         total_words, corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928             progress_per=progress_per, trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
   1124 
-> 1125         total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
   1126 
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
   1069             document_length = len(document.words)
   1070 
-> 1071             for tag in document.tags:
   1072                 _note_doctag(tag, document_length, docvecs)
   1073 

TypeError: 'int' object is not iterable

我不明白int類型來自哪里,因為 a: print(set([type(x) for x in train_tagged])) yield: {<class 'gensim.models.doc2vec.TaggedDocument'>}

請注意,其他故障排除,例如:

train_tagged = train.apply(lambda x: TaggedDocument(words=[get_words(x['Review'])], 
                                                    tags=[x['Sentiment']]), axis=1,)

產量:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-7bd5804d8d95> in <module>
----> 1 model_dbow.build_vocab(train_tagged)

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in build_vocab(self, documents, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    926         total_words, corpus_count = self.vocabulary.scan_vocab(
    927             documents=documents, corpus_file=corpus_file, docvecs=self.docvecs,
--> 928             progress_per=progress_per, trim_rule=trim_rule
    929         )
    930         self.corpus_count = corpus_count

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in scan_vocab(self, documents, corpus_file, docvecs, progress_per, trim_rule)
   1123             documents = TaggedLineDocument(corpus_file)
   1124 
-> 1125         total_words, corpus_count = self._scan_vocab(documents, docvecs, progress_per, trim_rule)
   1126 
   1127         logger.info(

c:\python367-64\lib\site-packages\gensim\models\doc2vec.py in _scan_vocab(self, documents, docvecs, progress_per, trim_rule)
   1073 
   1074             for word in document.words:
-> 1075                 vocab[word] += 1
   1076             total_words += len(document.words)
   1077 

TypeError: unhashable type: 'list'

您的第一次嘗試肯定是在TaggedDocument實例需要值列表的地方放置一個值——即使只有一個值列表。

我不確定您的第二次嘗試出了什么問題,但是您是否查看了train_tagged的代表性實例,例如train_tagged[0] ,以確保它是:

  • 單個TaggedDocument
  • 帶有一個listtags
  • 該列表中的每個項目都是一個簡單的字符串(或在高級使用中,從0開始的范圍內的int

另請注意,如果train_tagged是正確的TaggedDocument序列,您可以並且應該將其直接傳遞給build_vocab() (不需要奇怪的[x for x in train_tagged.values]構造。)

更一般地說,如果剛開始使用Doc2Vec ,從 Gensim 文檔中更簡單的示例開始會比“Towards Data Science”中的內容更好。 在“走向數據科學”中有大量非常糟糕的代碼和被誤導的做法。

您沒有將任何文件傳遞給您的實際培訓師,請參閱部分

model_dbow = Doc2Vec(dm=0 , [...])

0被解釋為 integer,這就是您收到錯誤的原因。 相反,您應該簡單地添加您的文檔,如 Doc2Vec 的gensim 文檔中詳述的那樣,並且可能對 go 很好。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM