[英]In Foundry Code Repositories, how do I iterate over all datasets in a directory?
I'm trying to read (all or multiple) datasets from single directory in single Pyspark transform.我正在尝试从单个 Pyspark 转换中的单个目录读取(所有或多个)数据集。 Is it possible to iterate over all the datasets in a path, without hardcoding individual datasets as input?
是否可以遍历路径中的所有数据集,而无需将单个数据集硬编码为输入?
I'd like to dynamically fetch different columns from multiple datasets without having to hardcode individual input datasets.我想从多个数据集中动态获取不同的列,而无需对单个输入数据集进行硬编码。
So this doesn't work since you will have inconsistent results every time you run CI.所以这不起作用,因为每次运行 CI 时都会得到不一致的结果。 This will break TLLV (transforms level logic versioning) by making it impossible to tell when logic actually has changed, thus marking a dataset as stale.
这将破坏 TLLV(转换级别逻辑版本控制),因为无法判断逻辑何时实际发生了变化,从而将数据集标记为过时。
You will have to write out the logical paths of each dataset you wish to transform, even if it means they are passed into a generated transform.您必须写出要转换的每个数据集的逻辑路径,即使这意味着它们被传递到生成的转换中。 There will need to be at least some consistent record of which datasets were targeted by which commit.
至少需要有一些一致的记录,说明哪些数据集是哪个提交的目标。
Another tactic to achieve what you're looking for is to make a single long dataset that is the unpivoted version of the datasets.实现您正在寻找的另一种策略是制作一个长数据集,它是数据集的非透视版本。 In this way, you could simple
APPEND
new rows / files to this dataset, which would let you accept arbitrary inputs, assuming your transform is constructed in such a way to handle this.通过这种方式,您可以简单地
APPEND
新行/文件到此数据集,这将允许您接受任意输入,假设您的转换是以这种方式构建的。 My rule of thumb is this: if you need dynamic schemas or dynamic counts of datasets, then you're better off using dynamic files / row counts in a single dataset.我的经验法则是:如果您需要动态模式或数据集的动态计数,那么最好在单个数据集中使用动态文件/行计数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.