dask - AttributeError: 'Series' object has no attribute 'split'

Question

I have over 8 million rows of text where I want to remove all stop words and also lemmatize the text using dask.map_partitions() but get the following error:

AttributeError: 'Series' object has no attribute 'split'

Is there any way to apply the function to the dataset?

Thanks for the help.

import pandas as pd
import dask.dataframe as dd
from spacy.lang.en import stop_words

cachedStopWords = list(stop_words.STOP_WORDS)

def stopwords_lemmatizing(text):
    return [word for word in text.split() if word not in cachedStopWords]

text = 'any length of text'
data = [{'content': text}]
df = pd.DataFrame(data, index=[0])
ddf = dd.from_pandas(df, npartitions=1)

ddf['content'] = ddf['content'].map_partitions(stopwords_lemmatizing, meta='f8')

Answer 1

map_partitions , as the name suggests, works on each partition of your overall dask dataframe, which are each pandas dataframes ( http://docs.dask.org/en/latest/dataframe.html#design ). Your function value-by-value for a seriesq, so what you actually wanted was the simple map :

ddf['content'] = ddf['content'].map(stopwords_lemmatizing)

(if you want to provide the meta here, it should be a zero-length Series rather than dataframe, eg, meta=pd.Series(dtype='O') ).

dask - AttributeError: 'Series' object has no attribute 'split'

Question

1 answers

solution1
1 ACCPTED 2019-03-26 02:40:06

dask - AttributeError: 'Series' object has no attribute 'split'

Question

1 answers

solution1 1 ACCPTED 2019-03-26 02:40:06

solution1
1 ACCPTED 2019-03-26 02:40:06