簡體 English 中英

如何確保在 Foundry Python Transforms 中構建的數據集中文件大小一致？

[英]How do I ensure consistent file sizes in datasets built in Foundry Python Transforms?

原文 2021-12-08 10:56:59 0 2 palantir-foundry/ foundry-code-repositories/ foundry-python-transform

我的 Foundry 轉換在不同的運行中產生不同數量的數據，但我希望每個文件中的行數相似。 我可以使用DataFrame.count()然后合並/重新分區，但這需要計算完整的數據集，然后再次緩存或重新計算它。 Spark有辦法解決這個問題嗎？

2 個解決方案

您可以使用 spark.sql.files.maxRecordsPerFile 配置選項，方法是按照 @transform 的 output 設置它：

output.write_dataframe(
    output_df,
    options={"maxRecordsPerFile": "1000000"},
)

如果您唯一關心的是每個文件的記錄數， proggeo的答案很有用。 但是，有時對數據進行分桶很有用，因此 Foundry 能夠優化下游操作，例如輪廓分析或其他轉換。

在這些情況下，您可以使用以下內容：

bucket_column = 'equipment_number'
num_files = 8
output_df = output_df.repartition(num_files, bucket_column)
output.write_dataframe(
    output_df,
    bucket_cols=[bucket_column],
    bucket_count=num_files,
)

如果您的存儲桶列分布良好，這將有助於保持每個數據集文件的行數相似。

如何在 Foundry 的 SQL 轉換中設置變量？

[英]How do I set a variable in Foundry's SQL Transforms?

如何確保我的 Foundry 作業以 static 分配運行？

[英]How do I ensure my Foundry job is running with static allocation?

如何在 Foundry 代碼存儲庫中使用本地 IDE 進行 Java 轉換？

[英]How do I use a local IDE for Java Transforms in Foundry Code Repositories?

如何在 Foundry 轉換中讀取和寫入列描述和類型類？

[英]How can I read and write column descriptions and typeclasses in foundry transforms?

如何降低 Foundry 轉換中的計算成本和浪費？

[英]How can I reduce compute costs and waste in my Foundry transforms?

在 Foundry Code Repositories 中，如何遍歷目錄中的所有數據集？

[英]In Foundry Code Repositories, how do I iterate over all datasets in a directory?

Foundry 轉換的 Python 單元測試？

[英]Python unit tests for Foundry's transforms?

如何從 Blobster API 獲取文件到 Foundry 轉換？

[英]How do I get a file from the Blobster API into a Foundry Transform?

如何在代碼工作簿中合並 Palantir Foundry 中的兩個數據集？

[英]How do I union two datasets in Palantir Foundry within a code workbook?

如何在代碼工作簿中加入 Palantir Foundry 中的兩個數據集？

[英]How do I JOIN two datasets in Palantir Foundry within a code workbook?

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何在 Foundry 的 SQL 轉換中設置變量？如何確保我的 Foundry 作業以 static 分配運行？如何在 Foundry 代碼存儲庫中使用本地 IDE 進行 Java 轉換？如何在 Foundry 轉換中讀取和寫入列描述和類型類？如何降低 Foundry 轉換中的計算成本和浪費？在 Foundry Code Repositories 中，如何遍歷目錄中的所有數據集？ Foundry 轉換的 Python 單元測試？如何從 Blobster API 獲取文件到 Foundry 轉換？如何在代碼工作簿中合並 Palantir Foundry 中的兩個數據集？如何在代碼工作簿中加入 Palantir Foundry 中的兩個數據集？

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM