简体   繁体   中英

can I define data filters with intake catalogs?

I would like to use intake to not only link to published datasets, but filter them in the catalog itself. Filtering is trivial to in python once you open the data, but this means providing the user code beyond the metadata in order to give some guidance.

Motivation: often the user is not as familiar with the dataset as the producer, and it would be nice to do some preprocessing for them without adding a series of different filtering steps in python.

eg if we have opened a csv already, we can filter with: df[df['rain'] > 70] but I don't see any arguments in read_csv for either pandas or dask to do this.

There is, indeed, no way to pass a filter to pandas' or dask's read_csv functions, and therefore this is nt an option supported by Intake's CSV driver.

However, Intake does support dataset transforms: https://intake.readthedocs.io/en/latest/transforms.html This means, that you can operate on the output of one data source, and assign a new catalogue entry to the result. The transform/computation would be performed on every access, the filtered dataset is not stored anywhere (unless you also use the persist functionality).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM