Count All Occurrences of a Specific Value in a Dask Dataframe

Question

I have a dask dataframe with thousands of columns and rows as follows:

pprint(daskdf.head())
   grid     lat      lon  ...  2014-12-29  2014-12-30  2014-12-31
0     0  48.125 -124.625  ...         0.0         0.0  -17.034216
1     0  48.625 -124.625  ...         0.0         0.0  -19.904214
4     0  42.375 -124.375  ...         0.0         0.0   -8.380443
5     0  42.625 -124.375  ...         0.0         0.0   -8.796803
6     0  42.875 -124.375  ...         0.0         0.0   -7.683688

I want to count all occurrences in the entire dataframe where a certain value appears. In pandas, this can be done as follows:

pddf[pddf==500].count().sum()

I'm aware that you can't translate all pandas functions/syntax with dask, but how would I do this with a dask dataframe? I tried doing:

daskdf[daskdf==500].count().sum().compute()

but this yielded a "Not Implemented" error.

Answer 1

As in many cases, where there is a row-wise pandas method which is not explicitly implemented yet in dask, you can use map_partitions . In this case this might look like:

ppdf.map_partitions(lambda df: df[df==500].count()).sum().compute()

You can experiment with whether also doing a .sum() within the lambda helps (it would produce smaller intermediaries) and what the meta= argument to map_partition should look like.

Count All Occurrences of a Specific Value in a Dask Dataframe

Question

1 answers

solution1
1 ACCPTED 2020-05-01 18:40:42

Count All Occurrences of a Specific Value in a Dask Dataframe

Question

1 answers

solution1 1 ACCPTED 2020-05-01 18:40:42

solution1
1 ACCPTED 2020-05-01 18:40:42