I have a requirement, Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10)
. Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1)
will the performance decrease? or do I have any other process to do so?
From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10)
, you will get max 10 files per partition. I would suggest using repartition($"COL")
, here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.