简体   繁体   中英

Count All Occurrences of a Specific Value in a Dask Dataframe

I have a dask dataframe with thousands of columns and rows as follows:

pprint(daskdf.head())
   grid     lat      lon  ...  2014-12-29  2014-12-30  2014-12-31
0     0  48.125 -124.625  ...         0.0         0.0  -17.034216
1     0  48.625 -124.625  ...         0.0         0.0  -19.904214
4     0  42.375 -124.375  ...         0.0         0.0   -8.380443
5     0  42.625 -124.375  ...         0.0         0.0   -8.796803
6     0  42.875 -124.375  ...         0.0         0.0   -7.683688

I want to count all occurrences in the entire dataframe where a certain value appears. In pandas, this can be done as follows:

pddf[pddf==500].count().sum()

I'm aware that you can't translate all pandas functions/syntax with dask, but how would I do this with a dask dataframe? I tried doing:

daskdf[daskdf==500].count().sum().compute()

but this yielded a "Not Implemented" error.

As in many cases, where there is a row-wise pandas method which is not explicitly implemented yet in dask, you can use map_partitions . In this case this might look like:

ppdf.map_partitions(lambda df: df[df==500].count()).sum().compute()

You can experiment with whether also doing a .sum() within the lambda helps (it would produce smaller intermediaries) and what the meta= argument to map_partition should look like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM