类型错误：doc2bow 需要输入的 unicode 标记数组，而不是使用 gensim.corpora.Dictionary() 时的单个字符串

Question

There is a dataframe like this:有一个像这样的数据框：

  index  terms   
  1345  ['jays', 'place', 'great', 'subway']    
  1543  ['described', 'communicative', 'friendly']    
  9874  ['great', 'sarahs', 'apartament', 'back']    
  2456  ['great', 'sarahs', 'apartament', 'back']

I try to create a dictionary from the corpus of comments[ 'terms' ], but I face an error message !我尝试从评论语料库 ['terms'] 创建字典，但我遇到了错误消息！

from gensim import corpora, models
dictionary = corpora.Dictionary( comments['terms'] )

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Answer 1

Each index needs to have its terms be in a sublist, all of which are nested within larger list.每个索引都需要将其术语放在一个子列表中，所有这些都嵌套在更大的列表中。

theterms = [['jays', 'place', 'great', 'subway'],['described', 'communicative', 'friendly'], ['great', 'sarahs', 'apartament', 'back'],['great', 'sarahs', 'apartament', 'back']] 

dictionary = corpora.Dictionary(theterms)

Answer 2

First convert comments['terms'] using comments['terms'].tolist() to a list and then run the corpora, it should work.首先使用comments['terms'].tolist()将comments['terms']转换为一个列表，然后运行语料库，它应该可以工作。 You can do other preprocessing like stemming or stopwords removal etc. before creating your dictionary.在创建字典之前，您可以进行其他预处理，如词干提取或停用词删除等。

类型错误：doc2bow 需要输入的 unicode 标记数组，而不是使用 gensim.corpora.Dictionary() 时的单个字符串

问题描述

2 个解决方案

解决方案1
1 2017-08-10 17:52:22

解决方案2
0 2017-08-18 16:32:07

类型错误：doc2bow 需要输入的 unicode 标记数组，而不是使用 gensim.corpora.Dictionary() 时的单个字符串

问题描述

2 个解决方案

解决方案1 1 2017-08-10 17:52:22

解决方案2 0 2017-08-18 16:32:07

解决方案1
1 2017-08-10 17:52:22

解决方案2
0 2017-08-18 16:32:07