使用 Databricks 中的 PySpark 在 Azure DataLake 中使用 partitionBy 和覆盖策略

Question

I have a simple ETL process in an Azure environment我在 Azure 环境中有一个简单的 ETL 过程

blob storage > datafactory > datalake raw > databricks > datalake curated > datwarehouse(main ETL). blob 存储 > 数据工厂 > 原始数据 > 数据块 > 数据湖策划 > 数据仓库（主 ETL）。

the datasets for this project are not very big (~1 million rows 20 columns give or take) however I would like to keep them partitioned properly in my datalake as Parquet files.这个项目的数据集不是很大（大约 100 万行 20 列给予或接受）但是我想将它们作为 Parquet 文件在我的数据湖中正确分区。

currently I run some simple logic to figure where in my lake each file should sit based off business calendars.目前我运行一些简单的逻辑来确定每个文件应该在我的湖中的哪个位置基于业务日历。

the files vaguely looks like this文件模糊地看起来像这样

Year Week Data
2019 01   XXX
2019 02   XXX

I then partition a given file into the following format replacing data that exists and creating new folders for new data.然后我将给定的文件分区为以下格式，替换存在的数据并为新数据创建新文件夹。

curated ---
           dataset --
                     Year 2019 
                              - Week 01 - file.pq + metadata
                              - Week 02 - file.pq + metadata
                              - Week 03 - file.pq + datadata #(pre existing file)

the metadata are success and commits that are auto generated.元数据是成功和自动生成的提交。

to this end i use the following query in Pyspark 2.4.3为此，我在 Pyspark 2.4.3 中使用以下查询

pyspark_dataframe.write.mode('overwrite')\
                         .partitionBy('Year','Week').parquet('\curated\dataset')

now if I use this command on it's own, it will overwrite any existing data in the target partition现在，如果我单独使用此命令，它将覆盖目标分区中的任何现有数据

so Week 03 will be lost.所以Week 03将丢失。

using spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") seems to stop the issue and only over write the target files but I wonder if this is the best way to handle files in my data lake?使用spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")似乎可以解决问题，只覆盖目标文件，但我想知道这是否是处理数据湖中文件的最佳方式？

also I've found it hard to find any documentation on the above feature.我还发现很难找到有关上述功能的任何文档。

my first instinct was to loop over a single parquet and write each partition manually, which although gives me greater control, but looping will be slow.我的第一直觉是循环遍历单个镶木地板并手动写入每个分区，这虽然给了我更好的控制，但循环会很慢。

my next thought would be to write each partition to a /tmp folder and move each parquet file and then replace files / create files as need be using the query from above.我的下一个想法是将每个分区写入/tmp文件夹并移动每个镶木地板文件，然后根据需要使用上面的查询替换文件/创建文件。 then purge the /tmp folder whilst creating some sort of metadata log.然后清除/tmp文件夹，同时创建某种元数据日志。

Is there a better way/method to this?有没有更好的方法/方法？

any guidance would be much appreciated.任何指导将不胜感激。

the end goal here is to have a clean and safe area for all 'Curated' data whilst having a log of parquet files I can read into a DataWarehouse for further ETL.这里的最终目标是为所有“精选”数据提供一个干净安全的区域，同时拥有一个镶木地板文件的日志，我可以将其读入数据仓库以进行进一步的 ETL。

Answer 1

I saw that you are using databricks in the azure stack.我看到您在 azure 堆栈中使用数据块。 I think the most viable and recommended method for you to use would be to make use of the new delta lake project in databricks:我认为最可行和最推荐的方法是使用 databricks 中的新delta Lake 项目：

It provides options for various upserts, merges and acid transactions to object stores like s3 or azure data lake storage.它为对象存储（如 s3 或 azure 数据湖存储）提供了各种更新插入、合并和酸交易的选项。 It basically provides the management, safety, isolation and upserts/merges provided by data warehouses to datalakes.它基本上提供了数据仓库向数据湖提供的管理、安全、隔离和更新插入/合并。 For one pipeline apple actually replaced its data warehouses to be run solely on delta databricks because of its functionality and flexibility.对于一个管道，由于其功能和灵活性，苹果实际上将其数据仓库替换为仅在增量数据块上运行。 For your use case and many others who use parquet, it is just a simple change of replacing 'parquet' with 'delta' , in order to use its functionality (if you have databricks).对于您的用例和许多其他使用镶木地板的用例，将 'parquet' 替换为 'delta'只是一个简单的更改，以便使用其功能（如果您有数据块）。 Delta is basically a natural evolution of parquet and databricks has done a great job by providing added functionality and as well as open sourcing it.三角洲基本上是实木复合地板和databricks的自然进化通过提供附加的功能和以及开源它做了很多工作。

For your case, I would suggest you try the replaceWhere option provided in delta.对于您的情况，我建议您尝试使用delta 中提供的replaceWhere选项。 Before making this targeted update , the target table has to be of format delta在进行此有针对性的更新之前，目标表的格式必须为delta

Instead of this:取而代之的是：

dataset.repartition(1).write.mode('overwrite')\
                         .partitionBy('Year','Week').parquet('\curataed\dataset')

From https://docs.databricks.com/delta/delta-batch.html :从https://docs.databricks.com/delta/delta-batch.html ：

'You can selectively overwrite only the data that matches predicates over partition columns ' '您可以选择性地仅覆盖与分区列上的谓词匹配的数据'

You could try this:你可以试试这个：

dataset.write.repartition(1)\
       .format("delta")\
       .mode("overwrite")\
       .partitionBy('Year','Week')\
       .option("replaceWhere", "Year == '2019' AND Week >='01' AND Week <='02'")\ #to avoid overwriting Week3
       .save("\curataed\dataset")

Also, if you wish to bring partitions to 1, why dont you use coalesce(1) as it will avoid a full shuffle.此外，如果您希望将分区设置为 1，为什么不使用coalesce(1)，因为它可以避免完全洗牌。

From https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/ :从https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/ ：

' replaceWhere is particularly useful when you have to run a computationally expensive algorithm , but only on certain partitions ' '当您必须运行计算成本高的算法时， replaceWhere特别有用，但仅限于某些分区' '

Therefore, I personally think that using replacewhere to manually specify your overwrite will be more targeted and computationally efficient then to just rely on: spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")因此，我个人认为使用 replacewhere 手动指定覆盖将更有针对性和计算效率，然后仅依赖于： spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")

Databricks provides optimizations on delta tables make it a faster, and much more efficient option to parquet( hence a natural evolution) by bin packing and z-ordering: Databricks 对 delta 表进行了优化，通过 bin 打包和 z 排序使其成为镶木地板的更快、更有效的选择（因此是自然进化）：

From Link: https://docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html来自链接： https : //docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html

WHERE (binpacking) WHERE （装箱）

'Optimize the subset of rows matching the given partition predicate. '优化匹配给定分区谓词的行子集。 Only filters involving partition key attributes are supported.'仅支持涉及分区键属性的过滤器。

ZORDER BY ZORDER BY

'Colocate column information in the same set of files. '在同一组文件中并置列信息。 Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read'. Delta Lake 数据跳过算法使用共定位来显着减少需要读取的数据量。

Faster query execution with indexing, statistics, and auto-caching support通过索引、统计和自动缓存支持更快地执行查询
Data reliability with rich schema validation and transactional guarantees具有丰富模式验证和事务保证的数据可靠性
Simplified data pipeline with flexible UPSERT support and unified Structured Streaming + batch processing on a single data source简化的数据管道，具有灵活的 UPSERT支持和单一数据源上的统一结构化流 + 批处理

You could also check out the complete documentation of the open source project: https://docs.delta.io/latest/index.html您还可以查看开源项目的完整文档： https : //docs.delta.io/latest/index.html

.. I also want to say that I do not work for databricks/delta lake. .. 我还想说我不为 databricks/delta 湖工作。 I have just seen their improvements and functionality benefit me in my work.我刚刚看到他们的改进和功能使我在工作中受益。

UPDATE:更新：

The gist of the question is "replacing data that exists and creating new folders for new data" and to do it in highly scalable and effective manner.问题的要点是“替换现有数据并为新数据创建新文件夹”，并以高度可扩展和有效的方式进行。

Using dynamic partition overwrite in parquet does the job however I feel like the natural evolution to that method is to use delta table merge operations which were basically created to 'integrate data from Spark DataFrames into the Delta Lake' .在 parquet 中使用动态分区覆盖可以完成这项工作，但是我觉得该方法的自然演变是使用增量表合并操作，这些操作基本上是为了“将来自 Spark DataFrames 的数据集成到 Delta Lake”而创建的。 They provide you with extra functionality and optimizations in merging your data based on how would want that to happen and keep a log of all actions on a table so you can rollback versions if needed.它们为您提供了额外的功能和优化，可以根据希望如何发生合并数据并在表上保留所有操作的日志，以便您可以在需要时回滚版本。

Delta lake python api (for merge): https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaMergeBuilder Delta Lake python api （用于合并）： https : //docs.delta.io/latest/api/python/index.html#delta.tables.DeltaMergeBuilder

databricks optimization: https://kb.databricks.com/delta/delta-merge-into.html#discussion数据块优化： https ://kb.databricks.com/delta/delta-merge-into.html#discussion

Using a single merge operation you can specify the condition merge on, in this case it could be a combination of the year and week and id, and then if the records match(meaning they exist in your spark dataframe and delta table, week1 and week2), update them with the data in your spark dataframe and leave other records unchanged:使用单个合并操作，您可以指定条件合并，在这种情况下，它可以是年份和周以及 id 的组合，然后如果记录匹配（意味着它们存在于您的 spark 数据框和增量表中，第 1 周和第 2 周)，使用 spark 数据框中的数据更新它们，并保持其他记录不变：

#you can also add additional condition if the records match, but not required
.whenMatchedUpdateAll(condition=None)

For some cases, if nothing matches then you might want to insert and create new rows and partitions, for that you can use:在某些情况下，如果没有匹配项，那么您可能想要插入并创建新的行和分区，为此您可以使用：

.whenNotMatchedInsertAll(condition=None)

You can use .您可以使用。 converttodelta operation https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.convertToDelta , to convert your parquet table to a delta table so that you can perform delta operations on it using the api. converttodelta操作https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.convertToDelta ，将您的镶木地板表转换为增量表，以便您可以使用接口。

'You can now convert a Parquet table in place to a Delta Lake table without rewriting any of the data. '您现在可以将 Parquet 表就地转换为 Delta Lake 表，而无需重写任何数据。 This is great for converting very large Parquet tables which would be costly to rewrite as a Delta table.这对于转换非常大的 Parquet 表非常有用，而将其重写为 Delta 表的成本很高。 Furthermore, this process is reversible'此外，这个过程是可逆的'

Your merge case ( replacing data where it exists and creating new records when it does not exist ) could go like this:您的合并案例（替换存在的数据并在不存在时创建新记录）可能如下所示：

(have not tested, refer to examples + api for syntax) （未测试，语法参考examples+api）

%python  
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`\curataed\dataset`")

deltaTable.alias("target").merge(dataset, "target.Year= dataset.Year  AND target.Week = dataset.Week") \
  .whenMatchedUpdateAll()\
  .whenNotMatchedInsertAll()\
  .execute()

If the delta table is partitioned correctly(year,week) and you used whenmatched clause correctly, these operations will be highly optimized and could take seconds in your case.如果增量表正确分区（年，周）并且您正确使用 whenmatched 子句，则这些操作将得到高度优化，并且在您的情况下可能需要几秒钟。 It also provides you with consistency, atomicity and data integrity with option to rollback.它还为您提供一致性、原子性和数据完整性以及回滚选项。

Some more functionality provided is that you can specify the set of columns to update if the match is made, (if you only need to update certain columns).提供的更多功能是，您可以指定要在匹配时更新的列集（如果您只需要更新某些列）。 You can also enable spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true") , so that delta uses minimal targeted partitions to carry out the merge(update,delete,create).您还可以启用spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true") ，以便 delta 使用最少的目标分区来执行合并（更新、删除、创建）。

Overall, I think using this approach is a very new and innovative way of carrying out targeted updates as it gives you more control over it while keeping ops highly efficient.总的来说，我认为使用这种方法是一种执行有针对性的更新的非常新颖和创新的方式，因为它可以让您更好地控制它，同时保持操作高效。 Using parquet with dynamic partitionoverwrite mode will also work fine however, delta lake features bring data quality to your data lake that is unmatched.将 parquet 与动态分区覆盖模式一起使用也可以正常工作，但是，delta 湖功能为您的数据湖带来无与伦比的数据质量。

My recommendation: I would say for now, use dynamic partition overwrite mode for parquet files to do your updates, and you could experiment and try to use the delta merge on just one table with the databricks optimization of spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true") and .whenMatchedUpdateAll() and compare the performance of both(your files are small so I do not think it will be a big difference).我的建议：我现在想说的是，对镶木地板文件使用动态分区覆盖模式来进行更新，您可以试验并尝试在一张表上使用增量合并，并使用spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")和.whenMatchedUpdateAll()并比较两者的性能（您的文件很小，所以我认为这不会有很大的不同）。 The databricks partition pruning optimization for merges article came out in Feb so it is really new and possibly could be a gamechanger for the overhead delta merge operations incur( as under the hood they just create new files, but partition pruning could speed it up)合并文章的 databricks 分区修剪优化于 2 月发布，因此它确实是新的，并且可能会改变产生的开销增量合并操作的游戏规则（因为在幕后他们只是创建新文件，但分区修剪可以加快速度）

Merge examples in python,scala,sql : https://docs.databricks.com/delta/delta-update.html#merge-examples在python、scala、sql 中合并示例： https : //docs.databricks.com/delta/delta-update.html#merge-examples

https://databricks.com/blog/2019/10/03/simple-reliable-upserts-and-deletes-on-delta-lake-tables-using-python-apis.html https://databricks.com/blog/2019/10/03/simple-reliable-upserts-and-deletes-on-delta-lake-tables-using-python-apis.html

Answer 2

Instead of writing the table directly we can use saveAsTable with append and remove the partitions before that.我们可以使用带有附加的saveAsTable并在此之前删除分区，而不是直接写入表。

dataset.repartition(1).write.mode('append')\
                     .partitionBy('Year','Week').saveAsTable("tablename")

For removing previous partitions用于删除以前的分区

partitions = [ (x["Year"], x["Week"]) for x in dataset.select("Year", "Week").distinct().collect()]
for year, week in partitions:
    spark.sql('ALTER TABLE tablename DROP IF EXISTS PARTITION (Year = "'+year+'",Week = "'+week+'")')

Answer 3

Correct me if I missed something crucial in your approach, but it seems like you want to write new data on top of the existing data, which is normally done with如果我在您的方法中遗漏了一些至关重要的内容，请纠正我，但您似乎想在现有数据之上编写新数据，这通常是用

write.mode('append')

instead of 'overwrite'而不是'overwrite'

If you want to keep the data separated by batch, so you can select it for upload to data warehouse or audit, there is no sensible way to do it besides including this information into the dataset and partitioning it during save, eg如果你想把数据按批次分开，所以你可以选择它上传到数据仓库或审计，除了将此信息包含在数据集中并在保存时对其进行分区之外，没有其他明智的方法，例如

dataset.write.mode('append')\
                     .partitionBy('Year','Week', 'BatchTimeStamp').parquet('curated\dataset')

Any other manual intervention into the parquet file format will be at best hacky, at worst risk making your pipeline unreliable or corrupting your data.对 Parquet 文件格式的任何其他手动干预充其量都是 hacky，最坏的风险是使您的管道不可靠或损坏您的数据。

Delta lake which Mohammad mentions is also a good suggestion overall for reliably storing data in data lakes and a golden industry standard right now. Mohammad 提到的 Delta 湖总体上也是一个很好的建议，可以将数据可靠地存储在数据湖中，并且是目前的黄金行业标准。 For your specific use case you could use its feature of making historical queries (append everything and then query for the difference between current dataset and after previous batch), however the audit log is limited in time to how you configure your delta lake and can be as low as 7 days, so if you want full information in the long term, you need to follow the approach of saving batch information anyway.对于您的特定用例，您可以使用其进行历史查询的功能（附加所有内容，然后查询当前数据集和上一批之后的差异），但是审计日志在时间上受到您如何配置 delta 湖的限制，并且可以低至7天，所以如果你想长期获得完整的信息，无论如何你都需要遵循保存批次信息的方法。

On a more strategical level, when following raw -> curated -> DW you can also consider adding another 'hop' and putting your ready data into a 'preprocessed' folder, organized by batch and then append it both to the curated and DW sets.在更具战略性的层面上，当遵循原始 -> 策展 -> DW 时，您还可以考虑添加另一个“跳跃”并将准备好的数据放入“预处理”文件夹中，按批次组织，然后将其附加到策展和 DW 集.

As a side note, .repartition(1) doesn't make too much sense when using parquets, as parquet is a multi-file format anyway, so the only effects doing this has are negative impacts on performance.附带说明一下， .repartition(1)在使用镶木地板时没有太大意义，因为镶木地板无论如何都是多文件格式，因此这样做的唯一影响是对性能的负面影响。 But please do let me know if there is a specific reason you are using it.但是，如果您使用它有特定的原因，请告诉我。

使用 Databricks 中的 PySpark 在 Azure DataLake 中使用 partitionBy 和覆盖策略

问题描述

3 个解决方案

解决方案1
14 已采纳 2020-03-06 03:43:43

解决方案2
1 2020-03-05 17:47:21

解决方案3
1 2020-03-06 23:42:56

使用 Databricks 中的 PySpark 在 Azure DataLake 中使用 partitionBy 和覆盖策略

问题描述

3 个解决方案

解决方案1 14 已采纳 2020-03-06 03:43:43

解决方案2 1 2020-03-05 17:47:21

解决方案3 1 2020-03-06 23:42:56

解决方案1
14 已采纳 2020-03-06 03:43:43

解决方案2
1 2020-03-05 17:47:21

解决方案3
1 2020-03-06 23:42:56