[英]Gensim: TypeError: doc2bow expects an array of unicode tokens on input, not a single string
[英]topic modeling error (doc2bow expects an array of unicode tokens on input, not a single string)
from nltk.tokenize import RegexpTokenizer
#from stop_words import get_stop_words
from gensim import corpora, models
import gensim
import os
from os import path
from time import sleep
filename_2 = "buisness1.txt"
file1 = open(filename_2, encoding='utf-8')
Reader = file1.read()
tdm = []
# Tokenized the text to individual terms and created the stop list
tokens = Reader.split()
#insert stopwords files
stopwordfile = open("StopWords.txt", encoding='utf-8')
# Use this to read file content as a stream
readstopword = stopwordfile.read()
stop_words = readstopword.split()
for r in tokens:
if not r in stop_words:
#stopped_tokens = [i for i in tokens if not i in en_stop]
tdm.append(r)
dictionary = corpora.Dictionary(tdm)
corpus = [dictionary.doc2bow(i) for i in tdm]
sleep(3)
#Implemented the LdaModel
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary)
print(ldamodel.print_topics(num_topics=1, num_words=1))
我正在尝试使用包含停用词的单独 txt 文件删除停用词。 在我删除停用词后,我将附加停用词中不存在的文本文件的单词。 我收到错误doc2bow expects an array of unicode tokens on input, not a single string
dictionary = corpora.Dictionary(tdm)
处的单个字符串。
谁能帮我更正我的代码
这几乎可以肯定是重复的,但请改用它:
dictionary = corpora.Dictionary([tdm])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.