我在我的 dataframe 上使用单词 tokenize function，通过编写 word_dict，但在执行错误消息“预期的字符串或类似字节的对象”之后

Question

I want to write the code word_dict , by calling the column name more_clean on the dataframe, but the error expected string or bytes-like object appears.我想通过在 dataframe 上调用列名more_clean来编写代码word_dict ，但是会出现错误预期的字符串或类似 object 的字节。

This is my dataframe:这是我的 dataframe： 数据帧图像

And this is my code:这是我的代码：

word_dict = {}
for i in range(0,len(df['more_clean'])):
    sentence = df['more_clean'][i]
    word_token = word_tokenize(sentence)
    for j in word_token:
        if j not in word_dict:
            word_dict[j] = 1
        else:
            word_dict[j] += 1

and an error message appears like this并出现这样的错误消息

TypeError: expected string or bytes-like object TypeError：预期的字符串或类似字节的 object

Answer 1

You need to make sure the sentence variable is of a str type:您需要确保sentence变量是str类型：

word_token = word_tokenize(str(sentence))

See the nltk.tokenize.word_tokenize documentation :请参阅nltk.tokenize.word_tokenize文档：

nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)

Parameters参数

text (str) – text to split into words text (str) -- 要拆分为单词的文本

language (str) – the model name in the Punkt corpus language (str) – Punkt 语料库中的 model 名称

preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not. preserve_line (bool) -- 决定是否对文本进行句子标记的标志。

我在我的 dataframe 上使用单词 tokenize function，通过编写 word_dict，但在执行错误消息“预期的字符串或类似字节的对象”之后

问题描述

1 个解决方案

解决方案1
0 2022-08-31 11:48:21

我在我的 dataframe 上使用单词 tokenize function，通过编写 word_dict，但在执行错误消息“预期的字符串或类似字节的对象”之后

问题描述

1 个解决方案

解决方案1 0 2022-08-31 11:48:21

解决方案1
0 2022-08-31 11:48:21