简体   繁体   English

在 Foundry Code Repositories 中,如何遍历目录中的所有数据集?

[英]In Foundry Code Repositories, how do I iterate over all datasets in a directory?

I'm trying to read (all or multiple) datasets from single directory in single Pyspark transform.我正在尝试从单个 Pyspark 转换中的单个目录读取(所有或多个)数据集。 Is it possible to iterate over all the datasets in a path, without hardcoding individual datasets as input?是否可以遍历路径中的所有数据集,而无需将单个数据集硬编码为输入?

I'd like to dynamically fetch different columns from multiple datasets without having to hardcode individual input datasets.我想从多个数据集中动态获取不同的列,而无需对单个输入数据集进行硬编码。

So this doesn't work since you will have inconsistent results every time you run CI.所以这不起作用,因为每次运行 CI 时都会得到不一致的结果。 This will break TLLV (transforms level logic versioning) by making it impossible to tell when logic actually has changed, thus marking a dataset as stale.这将破坏 TLLV(转换级别逻辑版本控制),因为无法判断逻辑何时实际发生了变化,从而将数据集标记为过时。

You will have to write out the logical paths of each dataset you wish to transform, even if it means they are passed into a generated transform.您必须写出要转换的每个数据集的逻辑路径,即使这意味着它们被传递到生成的转换中。 There will need to be at least some consistent record of which datasets were targeted by which commit.至少需要有一些一致的记录,说明哪些数据集是哪个提交的目标。

Another tactic to achieve what you're looking for is to make a single long dataset that is the unpivoted version of the datasets.实现您正在寻找的另一种策略是制作一个长数据集,它是数据集的非透视版本。 In this way, you could simple APPEND new rows / files to this dataset, which would let you accept arbitrary inputs, assuming your transform is constructed in such a way to handle this.通过这种方式,您可以简单地APPEND新行/文件到此数据集,这将允许您接受任意输入,假设您的转换是以这种方式构建的。 My rule of thumb is this: if you need dynamic schemas or dynamic counts of datasets, then you're better off using dynamic files / row counts in a single dataset.我的经验法则是:如果您需要动态模式或数据集的动态计数,那么最好在单个数据集中使用动态文件/行计数。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Foundry 代码存储库中使用本地 IDE 进行 Java 转换? - How do I use a local IDE for Java Transforms in Foundry Code Repositories? 如何在我的 Foundry 代码库中强制执行最低测试覆盖率百分比? - How do I enforce a minimum test coverage percentage in my Foundry Code Repositories? 如何在 Foundry 代码库中运行 pytesseract / tesseract? - How can I run pytesseract / tesseract in Foundry Code Repositories? 如何从代码存储库中找到 Foundry API? - How can I hit a Foundry API from Code Repositories? 如何在代码工作簿中合并 Palantir Foundry 中的两个数据集? - How do I union two datasets in Palantir Foundry within a code workbook? 如何在代码工作簿中加入 Palantir Foundry 中的两个数据集? - How do I JOIN two datasets in Palantir Foundry within a code workbook? 我如何迭代代码存储库中的 json 文件并逐步将 append 迭代到数据集 - How can i iterate over json files in code repositories and incrementally append to a dataset Palantir Foundry 增量测试很难迭代,我如何更快地发现错误? - Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster? 如何确保在 Foundry Python Transforms 中构建的数据集中文件大小一致? - How do I ensure consistent file sizes in datasets built in Foundry Python Transforms? 如何在 Foundry Code Workbook 的控制台中显示 matplotlib 图? - How do I display a matplotlib plot in the console in Foundry Code Workbook?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM