即使在增量优化后，ADLS Gen2 位置中的小文件也可用

Question

I am having a comparatively big external delta table which have data on ADLS gen2 location.我有一个相对较大的外部增量表，其中包含有关 ADLS gen2 位置的数据。 This table is partitioned by id and signal_date该表按id和signal_date分区

I am running a delta optimize query on this table on weekly basis.我每周都在该表上运行增量优化查询。 The query is as shown below.查询如下所示。

For few partitions we could see optimization runs more than 2 hrs as highlighted above.对于少数分区，我们可以看到优化运行超过 2 小时，如上所示。

And for the entire partitions for a week, this job is running more than 48 hrs and have to kill it in between in order to proceed with data load into this table.对于一周的整个分区，此作业运行超过 48 小时，必须在其间将其终止，以便继续将数据加载到该表中。 (otherwise encounter with concurrent operation failure error on the partition) （否则遇到分区并发操作失败错误）

Even after this optimization I could see many small files in the ADLS location for that partition as shown below.即使经过此优化，我仍可以在该分区的 ADLS 位置看到许多小文件，如下所示。

Is there anything wrong happening during the optimization considering the huge time for execution and availability of small files after optimization?考虑到优化后小文件的大量执行时间和可用性，优化过程中是否发生了任何错误？

Any thoughts/points appreciated!任何想法/观点表示赞赏！

Thanks In Advance.提前致谢。

Answer 1

OPTIMIZE doesn't remove the small files after it's finished - it creates a new version that refers new, bigger files, and marks old small files as removed, but the actual removal will happen when you run VACUUM . OPTIMIZE 完成后不会删除小文件 - 它会创建一个新版本，引用新的、更大的文件，并将旧的小文件标记为已删除，但实际删除将在您运行VACUUM时发生。

Here an example of the transaction log for OPTIMIZE performed:下面是执行 OPTIMIZE 的事务日志示例：

{"add":{"path":"part-00000-5046c283-8633-4fe2-8c99-ada26908e2d0-c000.snappy.parquet",
  "partitionValues":{},"size":1105,"modificationTime":1650281913487,
  "dataChange":false,"stats":"{\"numRecords\":200,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":99},\"nullCount\":{\"id\":0}}"}}
{"remove":{"path":"part-00002-57c4253a-5bdc-4d3e-9886-601fa793cdf6-c000.snappy.parquet",
  "deletionTimestamp":1650281912032,"dataChange":false,
  "extendedFileMetadata":true,"partitionValues":{},"size":536}}
...other remove entries...
{"commitInfo":{"timestamp":1650281913514,"operation":"OPTIMIZE",
  "operationParameters":{"predicate":"[\"true\"]"},"readVersion":1,
  "isolationLevel":"SnapshotIsolation","isBlindAppend":false,
  "operationMetrics":{"numRemovedFiles":"16","numRemovedBytes":"8640","p25FileSize":"1105","minFileSize":"1105","numAddedFiles":"1","maxFileSize":"1105","p75FileSize":"1105","p50FileSize":"1105","numAddedBytes":"1105"},
  "engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.0",
  "txnId":"31a68055-7932-4266-8290-9939af7e6a84"}}

即使在增量优化后，ADLS Gen2 位置中的小文件也可用

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-04-18 11:39:49

即使在增量优化后，ADLS Gen2 位置中的小文件也可用

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-04-18 11:39:49

解决方案1
1 已采纳 2022-04-18 11:39:49