简体   繁体   中英

`.iloc()` returns strange results when used with dask dataframe groupby

I have a large dataset with 3 columns:

   sku   center units
0   103896  1   2.0
1   103896  1   0.0
2   103896  1   5.0
3   103896  1   0.0
4   103896  1   7.0
5   103896  1   0

And I need to use a groupby-apply .

def function_a(x):
    return np.sum((x > 0).iloc[::-1].cumsum() == 0)

def function_b(x):
    return x.eq(0).sum()/((x.eq(0)&x.shift().ne(0)).sum())

Using dask ( df.groupby(['sku', 'center'])['units'].apply(function_a), meta=(float) ), I have many problems applying the first function because dask does not support index operations ( .iloc ), and the results are totally wrong.

Is it possible to apply those function using pyspark UDF ?

Assumptions

Your index (in the above example (0, 1, 2, 3, 4, 5)) corresponds to the correct sorting that you want. Eg by the data being CSVs of the form

0,103896,1,2.0
1,103896,1,0.0
2,103896,1,5.0

where the first columns corresponds the sample number. When you then read the data with:

import dask.dataframe as dd
df = dd.read_csv('path/to/data_*.csv', header=None)
df.columns = ['id', 'sku', 'center', 'units']
df = df.set_index('id')

this gives you a deterministic DataFrame. Meaning the index of the data is the same, no matter in what order the data is read from the drive.

Solution to .iloc() problem

You can then change function_a(x): to:

def function_a(x):
    return np.sum((x.sort_index(ascending=False) > 0).cumsum() == 0)

which should now work with

df.groupby(['sku', 'center'])['units'].apply(function_a, meta=(float))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM