How do I return multiple values from a function applied on a Dask Series? I am trying to return a series from each iteration of dask.Series.apply
and for the final result to be a dask.DataFrame
.
The following code tells me that the meta is wrong. The all-pandas version however works. What's wrong here?
Update: I think that I am not specifying the meta/schema correctly. How do I do it correctly? Now it works when I drop the meta argument. However, it raises a warning. I would like to use dask "correctly".
import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
def transformMyCol(x):
#Minimal Example Function
return(pd.Series(['Tom - ' + str(x),'Deskflip - ' + str(x / 8),'']))
#
## Pandas Version - Works as expected.
#
pandas_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
pandas_df.target.apply(transformMyCol,1)
#
## Dask Version (second attempt) - Raises a warning
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked = df.target.apply(transformMyCol)
unpacked.head()
#
## Dask Version (first attempt) - Raises an exception
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked_dask_schema = {"name" : str, "action" : str, "comments" : str}
unpacked = df.target.apply(transformMyCol, meta=unpacked_dask_schema)
unpacked.head()
This is the error that I get:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
I have also trued the following and it also does not work.
meta_df = pd.DataFrame(dtype='str',columns=list(unpacked_dask_schema.keys()))
unpacked = df.FILEDATA.apply(transformMyCol, meta=meta_df)
unpacked.head()
Same error:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
You're right, the problem is you're not specifying the meta correctly; more specifically and as the error message says, the metadata columns ( "name", "action", "comments"
) do not match the columns in the computed data ( 0, 1, 2
). You should either:
unpacked_dask_schema = dict.fromkeys(range(3), str)
df.target.apply(transformMyCol, meta=unpacked_dask_schema)
or
transformMyCol
to use the named columns:
def transformMyCol(x):
return pd.Series({
'name': 'Tom - ' + str(x),
'action': 'Deskflip - ' + str(x / 8),
'comments': '',
}))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.