简体   繁体   English

在 delta lake 中高效读取/转换分区数据

[英]Efficient reading/transforming partitioned data in delta lake

I have my data in a delta lake in ADLS and am reading it through Databricks.我在 ADLS 的三角洲湖中有我的数据,并通过 Databricks 读取它。 The data is partitioned by year and date and z ordered by storeIdNum, where there are about 10 store Id #s, each with a few million rows per date.数据按年和日期分区,z 按 storeIdNum 排序,其中大约有 10 个商店 ID #s,每个日期有几百万行。 When I read it, sometimes I am reading one date partition (~20 million rows) and sometimes I am reading in a whole month or year of data to do a batch operation.当我阅读它时,有时我正在读取一个日期分区(约 2000 万行),有时我正在读取整整一个月或一年的数据以进行批处理操作。 I have a 2nd much smaller table with around 75,000 rows per date that is also z ordered by storeIdNum and most of my operations involve joining the larger table of data to the smaller table on the storeIdNum (and some various other fields - like a time window, the smaller table is a roll up by hour and the other table has data points every second).我有第二个小得多的表,每个日期大约有 75,000 行,它也是 z 按 storeIdNum 排序的,我的大部分操作涉及将较大的数据表连接到 storeIdNum 上的较小表(以及一些其他字段 - 比如时间 window ,较小的表是按小时汇总的,另一个表每秒都有数据点)。 When I read the tables in, I join them and do a bunch of operations (group by, window by and partition by with lag/lead/avg/dense_rank functions, etc.).当我读入表格时,我加入它们并执行一系列操作(分组依据、window 依据和分区依据以及 lag/lead/avg/dense_rank 函数等)。

My question is: should I have the date in all of the joins, group by and partition by statements?我的问题是:我应该在所有的连接、分组依据和分区依据语句中都有日期吗? Whenever I am reading one date of data, I always have the year and the date in the statement that reads the data as I know I only want to read from a certain partition (or a year of partitions), but is it important to also reference the partition col.每当我读取一个数据日期时,我总是在读取数据的语句中包含年份和日期,因为我知道我只想从某个分区(或一年的分区)中读取,但也很重要参考分区列。 in windows and group bus for efficiencies, or is this redundant?在 windows 和组总线中提高效率,或者这是多余的? After the analysis/transformations, I am not going to overwrite/modify the data I am reading in, but instead write to a new table (likely partitioned on the same columns), in case that is a factor.在分析/转换之后,我不会覆盖/修改我正在读取的数据,而是写入一个新表(可能在相同的列上分区),以防这是一个因素。

For example:例如:

dfBig = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, UNIX_TS, BARCODE, CUSTNUM, .... FROM STORE_DATA_SECONDS WHERE YEAR = 2020 and DATE='2020-11-12'")
dfSmall = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, TS_HR, CUSTNUM, .... FROM STORE_DATA_HRS WHERE YEAR = 2020 and DATE='2020-11-12'")

Now, if I join them, do I want to include YEAR and DATE in the join, or should I just join on STORE_ID_NUM (and then any of the timestamp fields/customer Id number fields I need to join on)?现在,如果我加入他们,我是想在加入中包含 YEAR 和 DATE,还是应该只加入 STORE_ID_NUM(然后是我需要加入的任何时间戳字段/客户 ID 号字段)? I definitely need STORE_ID_NUM, but I can forego YEAR AND DATE if it is just adding another column and makes it more inefficient because it is more things to join on.我绝对需要 STORE_ID_NUM,但如果它只是添加另一列并使其效率更低,我可以放弃 YEAR AND DATE,因为它有更多的东西要加入。 I don't know how exactly it works, so I wanted to check as by foregoing the join, maybe I am making it more inefficient as I am not utilizing the partitions when doing the operations?我不知道它到底是如何工作的,所以我想通过前面的连接来检查,也许我在做操作时没有使用分区,所以效率降低了? Thank you!谢谢!

The key with delta is to choose the partitioned columns very well, this could take some trial and error, if you want to optimize the performance of the response, a technique I learned was to choose a filter column with low cardinality (you know if the problem is of time series, it will be the date, on the other hand if it is about a report for all clients in that case it may be convenient to choose your city), remember that if you work with delta each partition represents a level of the file structure where its cardinality will be the number of directories. delta 的关键是要很好地选择分区列,这可能需要反复试验,如果你想优化响应的性能,我学到的一种技术是选择基数较低的过滤列(你知道如果问题是时间序列,它将是日期,另一方面,如果它是关于所有客户的报告,在这种情况下选择您的城市可能会很方便),请记住,如果您使用增量,每个分区代表一个级别文件结构的基数将是目录的数量。

In your case I find it good to partition by YEAR, but I would add the MONTH given the number of records that would help somewhat with the dynamic pruning of spark在你的情况下,我发现按 YEAR 分区很好,但我会添加 MONTH 给定记录的数量,这对 spark 的动态修剪有所帮助

Another thing you can try is to use BRADCAST JOIN if the table is very small compared to the other.您可以尝试的另一件事是,如果该表与其他表相比非常小,则使用 BRADCAST JOIN。

Broadcast Hash Join en Spark (ES) 广播 Hash 加入 en Spark (ES)

Join Strategy Hints for SQL Queries SQL查询加入策略提示

The latter link explains how dynamic pruning helps in MERGE operations.后一个链接解释了动态修剪如何帮助 MERGE 操作。

How to improve performance of Delta Lake MERGE INTO queries using partition pruning 如何使用分区修剪提高 Delta Lake MERGE INTO 查询的性能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM