Getting “doc2bow expects an array of unicode tokens on input, not a single string” as a try to do nlp using gensim" Is there a solution?

Question

import gensim  
LDA = gensim.models.ldamodel.LdaModel 
dictionnary = corpora.Dictionary(docCleaned) #Error message appears here!!!
doc_term_matrix = [dictionary.doc2bow(doc) for doc in docCleaned]

Error Message ->

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Answer 1

corpora.Dictionary requires a list of strings whereas you are providing only a string to the constructor.

You may want to split the string into "documents". It depends on the nature of text you have. In the worst case, when each "document" will be one string - you can split on punctuation:

import string
import re
dictionnary = corpora.Dictionary(re.split('[' + re.escape(string.punctuation) + ']', docCleaned))

Getting “doc2bow expects an array of unicode tokens on input, not a single string” as a try to do nlp using gensim" Is there a solution?

Question

1 answers

solution1
0 2021-01-26 07:56:46

Getting “doc2bow expects an array of unicode tokens on input, not a single string” as a try to do nlp using gensim" Is there a solution?

Question

1 answers

solution1 0 2021-01-26 07:56:46

solution1
0 2021-01-26 07:56:46