简体   繁体   中英

Getting “doc2bow expects an array of unicode tokens on input, not a single string” as a try to do nlp using gensim" Is there a solution?

import gensim  
LDA = gensim.models.ldamodel.LdaModel 
dictionnary = corpora.Dictionary(docCleaned) #Error message appears here!!!
doc_term_matrix = [dictionary.doc2bow(doc) for doc in docCleaned]

Error Message ->

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

corpora.Dictionary requires a list of strings whereas you are providing only a string to the constructor.

You may want to split the string into "documents". It depends on the nature of text you have. In the worst case, when each "document" will be one string - you can split on punctuation:

import string
import re
dictionnary = corpora.Dictionary(re.split('[' + re.escape(string.punctuation) + ']', docCleaned))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM