简体   繁体   中英

How to do always necessary pre processing / cleaning with intake?

I'm having a use case where:

  • I always need to apply a pre-processing step to the data before being able to use it. (Because the naming etc. don't follow community conventions enforced by some software further down the processing chain.)

  • I cannot change the raw data. (Because it might be in a repo I don't control, or because it's too big to duplicate, ...)

If I aim at providing a user with the easiest and most transparent way of obtaining the data in a pre-processed way, I can see two ways of doing this:

1. Load unprocessed data with intake and apply the pre-processing immediately:

import intake
from my_tools import pre_process

cat = intake.open_catalog('...')
raw_df = cat.some_data.read()
df = pre_process(raw_df)

2. Apply the pre-processing step with the .read() call.

Catalog:

sources:
  some_data:
    args:
      urlpath: "/path/to/some_raw_data.csv"
    description: "Some data (already preprocessed)"
    driver: csv
    preprocess: my_tools.pre_process

And:

import intake

cat = intake.open_catalog('...')
df = cat.some_data.read()

Option 2. is not possible in Intake right now; Intake was designed to be "load" rather than "process", so we've avoided the pipeline idea for now, but we might come back to it in the future.

However, you have a couple of options within Intake that you could consider alongside Option 1., above:

  • make your own driver, which implements the load and any processing exactly how you like. Writing drivers is pretty easy, and can involve arbitrary code/complexity
  • write an alias -type driver, which takes the output of an entry in the same catalog and does something to it. See the docs and code for pointers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM