简体   繁体   中英

Unpack tuple inside function when using Dask map partitions

I'm trying to run a function over many partitions of a Dask dataframe. The code requires unpacking tuples and works well with Pandas but not with Dask map_partitions . The data corresponds to lists of tuples, where the length of the lists can vary, but the tuples are always of a known fixed length.

import dask.dataframe as dd
import pandas as pd

def func(df):
    for index, row in df.iterrows():
        tuples = row['A']
        for t in tuples:
            x, y = t
          # Do more stuff

# Create Pandas dataframe
# Each list may have a different length, tuples have fixed known length
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
# Pandas to Dask
ddf = dd.from_pandas(df, npartitions=2)

# Run function over Pandas dataframe
func(df)
# Run function over Dask dataframe
ddf.map_partitions(func).compute()

Here, the Pandas version runs with no issues. However, the Dask one, raises the error:

ValueError: Metadata inference failed in `func`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
ValueError('not enough values to unpack (expected 2, got 1)')

In my original function, I'm using these tuples as auxiliary variables, and the data which is finally returned is completely different so using meta doesn't fix the problem. How can I unpack the tuples?

When you use map_partitions without specifying meta , dask will try to run the functions to infer what the output is. This can cause problems if your function is not compatible with the sample dataframe used, you can see this sample dataframe with ddf._meta_nonempty (in this case it will return a column of foo ).

An easy fix in this case is to provide meta , it's okay for returned data to be of different format, eg if each returned result is a list, you can provide meta=list :

import dask.dataframe as dd
import pandas as pd

def func(df):
    for index, row in df.iterrows():
        tuples = row['A']
        for t in tuples:
            x, y = t
    return [1,2,3]

df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(func, meta=list).compute()

Another approach is to make your function compatible with the sample dataframe used. The sample dataframe has an object column but it contains foo rather than a list of tuples, so it cannot be unpacked as a tuple. Modifying your function to accept non-tuple columns (with x, *y = t ) will make it work:

import dask.dataframe as dd
import pandas as pd

def func(df):
    for index, row in df.iterrows():
        tuples = row['A']
        for t in tuples:
            x, *y = t
    return [1,2,3]

df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
#notice that no meta is specified here
ddf.map_partitions(func).compute()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM