I'm having a use case where:
I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)
I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)
If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:
import intake
from my_tools import pre_process
cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)
.read()
call. Catalog:
sources:
some_data:
args:
urlpath: "/path/to/some_raw_data.csv"
description: "Some data (already preprocessed)"
driver: csv
preprocess: my_tools.pre_process
And:
import intake
cat = intake.open_catalog('...')
df = cat.some_data.read()
Option 2. is not possible in Intake right now; Intake was designed to be "load" rather than "process", so we've avoided the pipeline idea for now, but we might come back to it in the future.
However, you have a couple of options within Intake that you could consider alongside Option 1., above:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.