简体   繁体   中英

Reduce Partitions by COALESCE in Spark SQL

I run Spark SQL queries and use them to perform data transformations and then store the final result-set (after a series of transformation steps) to S3 .

I recently noticed that one of my jobs was creating a large number of partition files while writing to S3 and taking a long time to complete (in fact it was failing). So I want to know if there is any way to do a COALESCE like function in the SQL API to reduce the number of partitions before writing to S3 ?

I know that the SQL API equivalent of re-partition is Cluster By . So was wondering if there is anything similar for COALESCE operation as well in the SQL API too.

Please note that I only have access to the SQL API so my question strictly pertains to Spark SQL API only. (For eg SELECT col from TABLE1 WHERE ...)

We use Spark SQL 2.4.6.7

Thanks

The docs suggest using hints to coalesce partitions, eg

SELECT /*+ COALESCE(3) */ col from TABLE1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM