`.iloc()` returns strange results when used with dask dataframe groupby

Question

I have a large dataset with 3 columns:

   sku   center units
0   103896  1   2.0
1   103896  1   0.0
2   103896  1   5.0
3   103896  1   0.0
4   103896  1   7.0
5   103896  1   0

And I need to use a groupby-apply .

def function_a(x):
    return np.sum((x > 0).iloc[::-1].cumsum() == 0)

def function_b(x):
    return x.eq(0).sum()/((x.eq(0)&x.shift().ne(0)).sum())

Using dask ( df.groupby(['sku', 'center'])['units'].apply(function_a), meta=(float) ), I have many problems applying the first function because dask does not support index operations ( .iloc ), and the results are totally wrong.

Is it possible to apply those function using pyspark UDF ?

Answer 1

Assumptions

Your index (in the above example (0, 1, 2, 3, 4, 5)) corresponds to the correct sorting that you want. Eg by the data being CSVs of the form

0,103896,1,2.0
1,103896,1,0.0
2,103896,1,5.0

where the first columns corresponds the sample number. When you then read the data with:

import dask.dataframe as dd
df = dd.read_csv('path/to/data_*.csv', header=None)
df.columns = ['id', 'sku', 'center', 'units']
df = df.set_index('id')

this gives you a deterministic DataFrame. Meaning the index of the data is the same, no matter in what order the data is read from the drive.

Solution to `.iloc()` problem

You can then change function_a(x): to:

def function_a(x):
    return np.sum((x.sort_index(ascending=False) > 0).cumsum() == 0)

which should now work with

df.groupby(['sku', 'center'])['units'].apply(function_a, meta=(float))

`.iloc()` returns strange results when used with dask dataframe groupby

Question

1 answers

solution1
1 ACCPTED 2019-11-23 14:46:42

Assumptions

Solution to `.iloc()` problem

`.iloc()` returns strange results when used with dask dataframe groupby

Question

1 answers

solution1 1 ACCPTED 2019-11-23 14:46:42

Assumptions

Solution to .iloc() problem

solution1
1 ACCPTED 2019-11-23 14:46:42

Solution to `.iloc()` problem