I have a dask dataframe with thousands of columns and rows as follows:
pprint(daskdf.head())
grid lat lon ... 2014-12-29 2014-12-30 2014-12-31
0 0 48.125 -124.625 ... 0.0 0.0 -17.034216
1 0 48.625 -124.625 ... 0.0 0.0 -19.904214
4 0 42.375 -124.375 ... 0.0 0.0 -8.380443
5 0 42.625 -124.375 ... 0.0 0.0 -8.796803
6 0 42.875 -124.375 ... 0.0 0.0 -7.683688
I want to count all occurrences in the entire dataframe where a certain value appears. In pandas, this can be done as follows:
pddf[pddf==500].count().sum()
I'm aware that you can't translate all pandas functions/syntax with dask, but how would I do this with a dask dataframe? I tried doing:
daskdf[daskdf==500].count().sum().compute()
but this yielded a "Not Implemented" error.
As in many cases, where there is a row-wise pandas method which is not explicitly implemented yet in dask, you can use map_partitions
. In this case this might look like:
ppdf.map_partitions(lambda df: df[df==500].count()).sum().compute()
You can experiment with whether also doing a .sum()
within the lambda helps (it would produce smaller intermediaries) and what the meta=
argument to map_partition
should look like.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.