nltk.word_tokenize 在 (n,2) 形状的大向量（数据帧）中不返回任何内容

Question

I have a basic dataset with one object named 'comment', one float named 'toxicity'.我有一个基本数据集，其中一个名为“comment”的 object，一个名为“toxicity”的浮点数。 My dataset's shape is (1999516, 2)我的数据集的形状是 (1999516, 2)

I'm trying to add a new column named 'tokenized' with nltk's word tokenized method and create bag of words like this:我正在尝试使用 nltk 的单词标记化方法添加一个名为“tokenized”的新列，并创建这样的单词包：

dataset = pd.read_csv('toxic_comment_classification_dataset.csv')

dataset['tokenized'] = dataset['comment'].apply(nltk.word_tokenize)

as mentioned in " IN [22] "如“ IN [22] ”中所述

I don't an get error message until i wait like 5 minutes after that i get this error:直到我等待 5 分钟后才收到错误消息，然后我收到此错误：

TypeError: expected string or bytes-like object TypeError：预期的字符串或类似字节的 object

How can I add tokenized comments in my vector (dataframe) as a new column?如何在我的向量（数据框）中添加标记化注释作为新列？

Answer 1

It depends on the data in your comment column.这取决于您评论栏中的数据。 It looks like not all of it is of string type.看起来并非全部都是字符串类型。 You can process only string data, and just keeep the other types as is with您只能处理字符串数据，而只保留其他类型

dataset['tokenized'] = dataset['comment'].apply(lambda x: nltk.word_tokenize(x) if isinstance(x,str) else x)

The nltk.word_tokenize(x) is a resource-consuming function. nltk.word_tokenize(x)是一个消耗资源的 function。 If you need to parallelize your Pandas code, there are special libraries, like Dask .如果您需要并行化您的 Pandas 代码，可以使用特殊的库，例如Dask 。 See Make Pandas DataFrame apply() use all cores?请参阅使 Pandas DataFrame apply() 使用所有内核？ . .

nltk.word_tokenize 在 (n,2) 形状的大向量（数据帧）中不返回任何内容

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-12-07 12:37:26

nltk.word_tokenize 在 (n,2) 形状的大向量（数据帧）中不返回任何内容

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-12-07 12:37:26

解决方案1
1 已采纳 2021-12-07 12:37:26