pandas 系列中的单词分词

Question

我在对熊猫系列中的单词进行分词时遇到问题。

我的系列名为df ：

                        text
0     This monitor is a great deal for the price.
1     I would recommend it.
2     poor packaging.
dtype: object

我试过df_tokenized=nltk.word_tokenize(df)但导致TypeError: expected string or bytes-like object

我还尝试了.apply(lambda row:)的 3 种变体

df_tokenized=df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
> TypeError: <lambda>() got an unexpected keyword argument 'axis'

df_tokenized=df.apply(lambda row: nltk.word_tokenize(row['text']))
> TypeError: string indices must be integers

df_tokenized=df.apply(lambda row: nltk.word_tokenize(row[1]))
> TypeError: 'float' object is not subscriptable

还有其他方法可以标记系列中的单词吗？

Answer 1

我相信您可以使用以下任何一项（这是您引用的第一个）：

import nltk
import pandas as pd

df = pd.DataFrame({'text': [' This monitor is a great deal for the price.',
                            'I would recommend it.',
                            'poor packaging.']})
print(df.info())

df_tokenized = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

print(df_tokenized)

和 output：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3 non-null      object
dtypes: object(1)
memory usage: 152.0+ bytes

     

                                  text
0   This monitor is a great deal for the price.
1                         I would recommend it.
2                               poor packaging.
0    [This, monitor, is, a, great, deal, for, the, ...
1                         [I, would, recommend, it, .]
2                                 [poor, packaging, .]
dtype: object

pandas 系列中的单词分词

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-10-28 15:02:03

pandas 系列中的单词分词

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-10-28 15:02:03

解决方案1
0 已采纳 2020-10-28 15:02:03