简体   繁体   中英

In Foundry Code Repositories, how do I iterate over all datasets in a directory?

I'm trying to read (all or multiple) datasets from single directory in single Pyspark transform. Is it possible to iterate over all the datasets in a path, without hardcoding individual datasets as input?

I'd like to dynamically fetch different columns from multiple datasets without having to hardcode individual input datasets.

So this doesn't work since you will have inconsistent results every time you run CI. This will break TLLV (transforms level logic versioning) by making it impossible to tell when logic actually has changed, thus marking a dataset as stale.

You will have to write out the logical paths of each dataset you wish to transform, even if it means they are passed into a generated transform. There will need to be at least some consistent record of which datasets were targeted by which commit.

Another tactic to achieve what you're looking for is to make a single long dataset that is the unpivoted version of the datasets. In this way, you could simple APPEND new rows / files to this dataset, which would let you accept arbitrary inputs, assuming your transform is constructed in such a way to handle this. My rule of thumb is this: if you need dynamic schemas or dynamic counts of datasets, then you're better off using dynamic files / row counts in a single dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM