简体   繁体   English

Foundry代码工作簿太慢,如何迭代更快?

[英]Foundry code workbooks are too slow, how to iterate faster?

I've noticed that code workbooks are too slow when querying from tables.我注意到从表中查询时代码工作簿太慢了。 It is much slower than using SQL from a data warehouse.它比从数据仓库使用 SQL 慢得多。 What is the correct workflow to quickly pull and join data for iterating analysis?快速提取和加入数据以进行迭代分析的正确工作流程是什么?

As I hinted on the comment, this is very hard to answer because code workbooks were designed for interactivity, so they are normally very fast.正如我在评论中暗示的那样,这很难回答,因为代码工作簿是为交互而设计的,所以它们通常非常快。 This doesn't mean that there can't be reasons for them to become slower.这并不意味着它们没有理由变慢。 I'll list some here, maybe they can help you speed up:我会在这里列出一些,也许它们可以帮助您加快速度:

  • Doing code workbooks straight from raw can be slow.直接从原始代码做代码工作簿可能很慢。 Check how many files and the types of files that back a particular dataset.检查支持特定数据集的文件数量和文件类型。 In raw these may be CSV files and not snappy/parquet which would make your compute faster.在原始文件中,这些文件可能是 CSV 个文件,而不是 snappy/parquet,这会使您的计算速度更快。 Which will lead code workbooks to try to infer schema every time you try to iterate.这将导致代码工作簿在您每次尝试迭代时尝试推断模式。 Adding a simple raw -> clean transform in pyspark code repositories, may help a ton here.在 pyspark 代码存储库中添加一个简单的raw -> clean transform ,在这里可能会有很大帮助。

  • Your dataset may be poorly optimized.您的数据集可能优化不佳。 Having too many files for the datasize.数据大小的文件过多。 This will lead to code workbooks to take a lot of time hitting disk opening each file.这将导致代码工作簿花费大量时间打开每个文件的磁盘。 You can verify the files this by going into dataset details tab -> files and check the size of your files.您可以通过进入数据集详细信息选项卡 -> 文件并检查文件的大小来验证文件。 It may be worth to add a repartition on your clean step (same as above).在您的清理步骤中添加一个重新分区可能是值得的(与上面相同)。 This is spark, not foundry read more here Is it better to have one large parquet file or lots of smaller parquet files?这是 spark,不是铸造厂 阅读更多拥有一个大的 parquet 文件还是许多较小的 parquet 文件更好?

  • Your organization may not have enough resources for your compute, or you may have too many people using code workbooks at the same time, for whatever quota your set up.对于您设置的任何配额,您的组织可能没有足够的资源用于您的计算,或者您可能有太多人同时使用代码工作簿。 This is something you'll need to check with your platform team, or support channels.这是您需要与您的平台团队或支持渠道核实的内容。

  • Using AQE and Local mode: How do I get better performance in my Palantir Foundry transformation when my data scale is small?使用 AQE 和本地模式: 当我的数据规模较小时,如何在我的 Palantir Foundry 转换中获得更好的性能?

  • If you are using python: Not using udfs, these can make your code particularly slow, specially if you are comparing against SQL. PySpark UDFs are known for being notoriously slow Spark functions vs UDF performance?如果您使用的是 python:不使用 udfs,这些会使您的代码特别慢,特别是当您与 SQL 进行比较时。PySpark UDF 以Spark 函数与 UDF 性能相比众所周知的慢而闻名?

"What is the correct workflow to quickly pull and join data for iterating analysis?" “快速提取和合并数据以进行迭代分析的正确工作流程是什么?”

For quick one-off analysis I would recommend to use the Foundry JDBC/ODBC Driver (installed on your local computer) and query the Foundry SQL Server.对于快速的一次性分析,我建议使用 Foundry JDBC/ODBC 驱动程序(安装在您的本地计算机上)并查询 Foundry SQL 服务器。 Note, this will only work with moderate data set result sizes and low query complexities.请注意,这仅适用于中等数据集结果大小和低查询复杂性。

This will allow you to have turnaround times of seconds instead of minutes on your queries.这将使您的查询周转时间缩短为数秒而不是数分钟。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Palantir Foundry 的代码工作簿中如何分配执行者? - How are executors assigned in Code Workbooks in Palantir Foundry? 如何在 Foundry Code Workbooks 中将字符串列(具有 4 位年份值)转换为 DATE 类型? - How do you convert a string column (with 4-digit year values) to a DATE type in Foundry Code Workbooks? Palantir Foundry 增量测试很难迭代,我如何更快地发现错误? - Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster? 在 Foundry Code Repositories 中,如何遍历目录中的所有数据集? - In Foundry Code Repositories, how do I iterate over all datasets in a directory? 如何更快地计算我的 Foundry“最新版本”数据集? - How do I compute my Foundry 'latest version' dataset faster? python代工厂的代码作业本怎么写 - How to write python foundry's code workbook 如何将日志从代工厂平台的代码库写入新文件 - How to Write logs into new file from Code Repository in foundry Platform 如何从代码存储库中找到 Foundry API? - How can I hit a Foundry API from Code Repositories? 如何从 Foundry 中的代码存储库编写或创建外部数据集 - How to write or create external dataset from code repository within Foundry 如何在 Foundry 代码库中运行 pytesseract / tesseract? - How can I run pytesseract / tesseract in Foundry Code Repositories?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM