简体   繁体   中英

dask - AttributeError: 'Series' object has no attribute 'split'

I have over 8 million rows of text where I want to remove all stop words and also lemmatize the text using dask.map_partitions() but get the following error:

AttributeError: 'Series' object has no attribute 'split'

Is there any way to apply the function to the dataset?

Thanks for the help.

import pandas as pd
import dask.dataframe as dd
from spacy.lang.en import stop_words

cachedStopWords = list(stop_words.STOP_WORDS)

def stopwords_lemmatizing(text):
    return [word for word in text.split() if word not in cachedStopWords]

text = 'any length of text'
data = [{'content': text}]
df = pd.DataFrame(data, index=[0])
ddf = dd.from_pandas(df, npartitions=1)

ddf['content'] = ddf['content'].map_partitions(stopwords_lemmatizing, meta='f8')

map_partitions , as the name suggests, works on each partition of your overall dask dataframe, which are each pandas dataframes ( http://docs.dask.org/en/latest/dataframe.html#design ). Your function value-by-value for a seriesq, so what you actually wanted was the simple map :

ddf['content'] = ddf['content'].map(stopwords_lemmatizing)

(if you want to provide the meta here, it should be a zero-length Series rather than dataframe, eg, meta=pd.Series(dtype='O') ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM