dask.dataframe 上的 WordNetLemmatizer 错误与“WordNetCorpusReader”对象没有属性“_LazyCorpusLoader__args”

Question

我正在尝试对 dask 数据框进行词干提取

wnl = WordNetLemmatizer()

def lemmatizing(sentence):
    stemSentence = ""

    for word in sentence.split():
        stem = wnl.lemmatize(word)
        stemSentence += stem
        stemSentence += " "

        stemSentence = stemSentence.strip()

    return stemSentence

df['news_content'] = df['news_content'].apply(stemming).compute()

但我收到以下错误：

AttributeError: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'

我已经尝试过这里推荐的东西，但没有任何运气。

谢谢您的帮助。

Answer 1

这是因为wordnet模块被“懒惰地读取”并且尚未评估。

使其工作的一个技巧是在 Dask 数据帧中使用WordNetLemmatizer()之前首先使用它一次，例如

>>> from nltk.stem import WordNetLemmatizer
>>> import dask.dataframe as dd

>>> df = dd.read_csv('something.csv')
>>> df.head()
                      text  label
0       this is a sentence      1
1  that is a foo bar thing      0


>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('cats') # Use it once first, to "unlazify" wordnet.
'cat'

# Now you can use it with Dask dataframe's .apply() function.
>>> lemmatize_text = lambda sent: [wnl.lemmatize(word) for word in sent.split()]

>>> df['lemmas'] = df['text'].apply(lemmatize_text)
>>> df.head()
                      text  label                          lemmas
0       this is a sentence      1         [this, is, a, sentence]
1  that is a foo bar thing      0  [that, is, a, foo, bar, thing]

或者，您可以尝试pywsd ：

pip install -U pywsd

然后在代码中：

>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.131901025772095 secs.

>>> import dask.dataframe as dd

>>> df = dd.read_csv('something.csv')
>>> df.head()
                      text  label
0       this is a sentence      1
1  that is a foo bar thing      0

>>> df['lemmas'] = df['text'].apply(lemmatize_sentence)
>>> df.head()
                      text  label                          lemmas
0       this is a sentence      1         [this, be, a, sentence]
1  that is a foo bar thing      0  [that, be, a, foo, bar, thing]

dask.dataframe 上的 WordNetLemmatizer 错误与“WordNetCorpusReader”对象没有属性“_LazyCorpusLoader__args”

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-03-04 08:16:24

dask.dataframe 上的 WordNetLemmatizer 错误与“WordNetCorpusReader”对象没有属性“_LazyCorpusLoader__args”

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-03-04 08:16:24

解决方案1
1 已采纳 2019-03-04 08:16:24