簡體   English   中英

主題建模錯誤(doc2bow 需要輸入 unicode 令牌數組,而不是單個字符串)

[英]topic modeling error (doc2bow expects an array of unicode tokens on input, not a single string)

from nltk.tokenize import RegexpTokenizer
#from stop_words import get_stop_words
from gensim import corpora, models 
import gensim
import os
from os import path
from time import sleep

filename_2 = "buisness1.txt"
file1 = open(filename_2, encoding='utf-8')  
Reader = file1.read()
tdm = []

# Tokenized the text to individual terms and created the stop list
tokens = Reader.split()
#insert stopwords files
stopwordfile = open("StopWords.txt", encoding='utf-8')  

# Use this to read file content as a stream  
readstopword = stopwordfile.read() 
stop_words = readstopword.split() 

for r in tokens:  
    if not r in stop_words: 
        #stopped_tokens = [i for i in tokens if not i in en_stop]
        tdm.append(r)

dictionary = corpora.Dictionary(tdm)
corpus = [dictionary.doc2bow(i) for i in tdm]
sleep(3)
#Implemented the LdaModel
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary)
print(ldamodel.print_topics(num_topics=1, num_words=1))

我正在嘗試使用包含停用詞的單獨 txt 文件刪除停用詞。 在我刪除停用詞后,我將附加停用詞中不存在的文本文件的單詞。 我收到錯誤doc2bow expects an array of unicode tokens on input, not a single string dictionary = corpora.Dictionary(tdm)處的單個字符串。

誰能幫我更正我的代碼

這幾乎可以肯定是重復的,但請改用它:

dictionary = corpora.Dictionary([tdm])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM