[英]I use the word tokenize function on my dataframe, by writing word_dict, but after executing the error message 'expected string or bytes-like object'
I want to write the code word_dict
, by calling the column name more_clean
on the dataframe, but the error expected string or bytes-like object appears.我想通过在 dataframe 上调用列名
more_clean
来编写代码word_dict
,但是会出现错误预期的字符串或类似 object 的字节。
This is my dataframe:这是我的 dataframe:
And this is my code:这是我的代码:
word_dict = {}
for i in range(0,len(df['more_clean'])):
sentence = df['more_clean'][i]
word_token = word_tokenize(sentence)
for j in word_token:
if j not in word_dict:
word_dict[j] = 1
else:
word_dict[j] += 1
and an error message appears like this并出现这样的错误消息
TypeError: expected string or bytes-like object
TypeError:预期的字符串或类似字节的 object
You need to make sure the sentence
variable is of a str
type:您需要确保
sentence
变量是str
类型:
word_token = word_tokenize(str(sentence))
See the nltk.tokenize.word_tokenize
documentation :请参阅
nltk.tokenize.word_tokenize
文档:
nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)
Parameters
参数
text
(str) – text to split into wordstext
(str) -- 要拆分为单词的文本language
(str) – the model name in the Punkt corpuslanguage
(str) – Punkt 语料库中的 model 名称preserve_line
(bool) – A flag to decide whether to sentence tokenize the text or not.preserve_line
(bool) -- 决定是否对文本进行句子标记的标志。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.