简体   繁体   English

即使在增量优化后,ADLS Gen2 位置中的小文件也可用

[英]Small files available in ADLS Gen2 location even after delta optimization

I am having a comparatively big external delta table which have data on ADLS gen2 location.我有一个相对较大的外部增量表,其中包含有关 ADLS gen2 位置的数据。 This table is partitioned by id and signal_date该表按idsignal_date分区

I am running a delta optimize query on this table on weekly basis.我每周都在该表上运行增量优化查询。 The query is as shown below.查询如下所示。

在此处输入图像描述

For few partitions we could see optimization runs more than 2 hrs as highlighted above.对于少数分区,我们可以看到优化运行超过 2 小时,如上所示。

And for the entire partitions for a week, this job is running more than 48 hrs and have to kill it in between in order to proceed with data load into this table.对于一周的整个分区,此作业运行超过 48 小时,必须在其间将其终止,以便继续将数据加载到该表中。 (otherwise encounter with concurrent operation failure error on the partition) (否则遇到分区并发操作失败错误)

Even after this optimization I could see many small files in the ADLS location for that partition as shown below.即使经过此优化,我仍可以在该分区的 ADLS 位置看到许多小文件,如下所示。

在此处输入图像描述 在此处输入图像描述

Is there anything wrong happening during the optimization considering the huge time for execution and availability of small files after optimization?考虑到优化后小文件的大量执行时间和可用性,优化过程中是否发生了任何错误?

Any thoughts/points appreciated!任何想法/观点表示赞赏!

Thanks In Advance.提前致谢。

OPTIMIZE doesn't remove the small files after it's finished - it creates a new version that refers new, bigger files, and marks old small files as removed, but the actual removal will happen when you run VACUUM . OPTIMIZE 完成后不会删除小文件 - 它会创建一个新版本,引用新的、更大的文件,并将旧的小文件标记为已删除,但实际删除将在您运行VACUUM时发生。

Here an example of the transaction log for OPTIMIZE performed:下面是执行 OPTIMIZE 的事务日志示例:

{"add":{"path":"part-00000-5046c283-8633-4fe2-8c99-ada26908e2d0-c000.snappy.parquet",
  "partitionValues":{},"size":1105,"modificationTime":1650281913487,
  "dataChange":false,"stats":"{\"numRecords\":200,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":99},\"nullCount\":{\"id\":0}}"}}
{"remove":{"path":"part-00002-57c4253a-5bdc-4d3e-9886-601fa793cdf6-c000.snappy.parquet",
  "deletionTimestamp":1650281912032,"dataChange":false,
  "extendedFileMetadata":true,"partitionValues":{},"size":536}}
...other remove entries...
{"commitInfo":{"timestamp":1650281913514,"operation":"OPTIMIZE",
  "operationParameters":{"predicate":"[\"true\"]"},"readVersion":1,
  "isolationLevel":"SnapshotIsolation","isBlindAppend":false,
  "operationMetrics":{"numRemovedFiles":"16","numRemovedBytes":"8640","p25FileSize":"1105","minFileSize":"1105","numAddedFiles":"1","maxFileSize":"1105","p75FileSize":"1105","p50FileSize":"1105","numAddedBytes":"1105"},
  "engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.0",
  "txnId":"31a68055-7932-4266-8290-9939af7e6a84"}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 ADLS gen2 上的增量表流式传输时出现 AzureBlobFileSystem FileNotFoundException - AzureBlobFileSystem FileNotFoundException when streaming from a Delta table on ADLS gen2 如何使用 Terraform 批准 Blob 存储 ADLS Gen2 上的托管专用端点? - How to use Terraform to approve a Managed Private Endpoint on a Blob Storage ADLS Gen2? SAS 令牌使用 Azure java 目录级别的 ADLS Gen2 AD 服务原则 - SAS token using Azure AD Service Principle for ADLS Gen2 at directory level in java 将数据从本地 sql 服务器复制到 Azure Data Lake Storage Gen2 中的增量格式 - copy data from on premise sql server to delta format in Azure Data Lake Storage Gen2 firebase 路由上 golang 中的 google cloud function gen2 的身份验证问题 - Authentication problem with google cloud function gen2 in golang on firebase routing ADLS 移动文件到父目录 - ADLS move files to a parent directory 使用 spark 和 python adls gen 2 仅列出子文件夹名称 - List only the subfolder names using spark and python adls gen 2 部署 Gen2 云时出现权限被拒绝错误 Function - Permission Denied Error while deploying Gen2 Cloud Function 云 Function 部署 gen2 - function 已经存在错误? - Cloud Function deploy gen2 - function already exist error? 是否可以为 DataLake Gen2 存储中的目录创建 SAS 令牌? - Is it possible to create a SAS token for a directory in DataLake Gen2 storage?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM