简体   繁体   中英

Small files available in ADLS Gen2 location even after delta optimization

I am having a comparatively big external delta table which have data on ADLS gen2 location. This table is partitioned by id and signal_date

I am running a delta optimize query on this table on weekly basis. The query is as shown below.

在此处输入图像描述

For few partitions we could see optimization runs more than 2 hrs as highlighted above.

And for the entire partitions for a week, this job is running more than 48 hrs and have to kill it in between in order to proceed with data load into this table. (otherwise encounter with concurrent operation failure error on the partition)

Even after this optimization I could see many small files in the ADLS location for that partition as shown below.

在此处输入图像描述 在此处输入图像描述

Is there anything wrong happening during the optimization considering the huge time for execution and availability of small files after optimization?

Any thoughts/points appreciated!

Thanks In Advance.

OPTIMIZE doesn't remove the small files after it's finished - it creates a new version that refers new, bigger files, and marks old small files as removed, but the actual removal will happen when you run VACUUM .

Here an example of the transaction log for OPTIMIZE performed:

{"add":{"path":"part-00000-5046c283-8633-4fe2-8c99-ada26908e2d0-c000.snappy.parquet",
  "partitionValues":{},"size":1105,"modificationTime":1650281913487,
  "dataChange":false,"stats":"{\"numRecords\":200,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":99},\"nullCount\":{\"id\":0}}"}}
{"remove":{"path":"part-00002-57c4253a-5bdc-4d3e-9886-601fa793cdf6-c000.snappy.parquet",
  "deletionTimestamp":1650281912032,"dataChange":false,
  "extendedFileMetadata":true,"partitionValues":{},"size":536}}
...other remove entries...
{"commitInfo":{"timestamp":1650281913514,"operation":"OPTIMIZE",
  "operationParameters":{"predicate":"[\"true\"]"},"readVersion":1,
  "isolationLevel":"SnapshotIsolation","isBlindAppend":false,
  "operationMetrics":{"numRemovedFiles":"16","numRemovedBytes":"8640","p25FileSize":"1105","minFileSize":"1105","numAddedFiles":"1","maxFileSize":"1105","p75FileSize":"1105","p50FileSize":"1105","numAddedBytes":"1105"},
  "engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.0",
  "txnId":"31a68055-7932-4266-8290-9939af7e6a84"}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM