[英]Foundry code workbooks are too slow, how to iterate faster?
I've noticed that code workbooks are too slow when querying from tables.我注意到从表中查询时代码工作簿太慢了。 It is much slower than using SQL from a data warehouse.
它比从数据仓库使用 SQL 慢得多。 What is the correct workflow to quickly pull and join data for iterating analysis?
快速提取和加入数据以进行迭代分析的正确工作流程是什么?
As I hinted on the comment, this is very hard to answer because code workbooks were designed for interactivity, so they are normally very fast.正如我在评论中暗示的那样,这很难回答,因为代码工作簿是为交互而设计的,所以它们通常非常快。 This doesn't mean that there can't be reasons for them to become slower.
这并不意味着它们没有理由变慢。 I'll list some here, maybe they can help you speed up:
我会在这里列出一些,也许它们可以帮助您加快速度:
Doing code workbooks straight from raw can be slow.直接从原始代码做代码工作簿可能很慢。 Check how many files and the types of files that back a particular dataset.
检查支持特定数据集的文件数量和文件类型。 In raw these may be CSV files and not snappy/parquet which would make your compute faster.
在原始文件中,这些文件可能是 CSV 个文件,而不是 snappy/parquet,这会使您的计算速度更快。 Which will lead code workbooks to try to infer schema every time you try to iterate.
这将导致代码工作簿在您每次尝试迭代时尝试推断模式。 Adding a simple
raw -> clean transform
in pyspark code repositories, may help a ton here.在 pyspark 代码存储库中添加一个简单的
raw -> clean transform
,在这里可能会有很大帮助。
Your dataset may be poorly optimized.您的数据集可能优化不佳。 Having too many files for the datasize.
数据大小的文件过多。 This will lead to code workbooks to take a lot of time hitting disk opening each file.
这将导致代码工作簿花费大量时间打开每个文件的磁盘。 You can verify the files this by going into dataset details tab -> files and check the size of your files.
您可以通过进入数据集详细信息选项卡 -> 文件并检查文件的大小来验证文件。 It may be worth to add a repartition on your clean step (same as above).
在您的清理步骤中添加一个重新分区可能是值得的(与上面相同)。 This is spark, not foundry read more here Is it better to have one large parquet file or lots of smaller parquet files?
这是 spark,不是铸造厂 阅读更多拥有一个大的 parquet 文件还是许多较小的 parquet 文件更好?
Your organization may not have enough resources for your compute, or you may have too many people using code workbooks at the same time, for whatever quota your set up.对于您设置的任何配额,您的组织可能没有足够的资源用于您的计算,或者您可能有太多人同时使用代码工作簿。 This is something you'll need to check with your platform team, or support channels.
这是您需要与您的平台团队或支持渠道核实的内容。
Using AQE and Local mode: How do I get better performance in my Palantir Foundry transformation when my data scale is small?使用 AQE 和本地模式: 当我的数据规模较小时,如何在我的 Palantir Foundry 转换中获得更好的性能?
If you are using python: Not using udfs, these can make your code particularly slow, specially if you are comparing against SQL. PySpark UDFs are known for being notoriously slow Spark functions vs UDF performance?如果您使用的是 python:不使用 udfs,这些会使您的代码特别慢,特别是当您与 SQL 进行比较时。PySpark UDF 以Spark 函数与 UDF 性能相比众所周知的慢而闻名?
"What is the correct workflow to quickly pull and join data for iterating analysis?"
“快速提取和合并数据以进行迭代分析的正确工作流程是什么?”
For quick one-off analysis I would recommend to use the Foundry JDBC/ODBC Driver (installed on your local computer) and query the Foundry SQL Server.对于快速的一次性分析,我建议使用 Foundry JDBC/ODBC 驱动程序(安装在您的本地计算机上)并查询 Foundry SQL 服务器。 Note, this will only work with moderate data set result sizes and low query complexities.
请注意,这仅适用于中等数据集结果大小和低查询复杂性。
This will allow you to have turnaround times of seconds instead of minutes on your queries.这将使您的查询周转时间缩短为数秒而不是数分钟。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.