简体   繁体   English

如何在Spark / Hive中将大部分数据合并到单个目录中

[英]How to coalesce large portioned data into single directory in spark/Hive

I have a requirement, Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10) . 我有一个要求,将大量数据分区并将其插入到Hive中。要绑定此数据,我正在使用DF.Coalesce(10) Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? 现在我要将这部分数据绑定到单个目录,如果使用DF.Coalesce(1) ,性能会降低吗? or do I have any other process to do so? 还是我有其他任何流程?

From what I understand is that you are trying to ensure that there are less no of files per partition. 据我了解,您正在尝试确保每个分区的文件更少。 So, by using coalesce(10) , you will get max 10 files per partition. 因此,通过使用coalesce(10) ,每个分区最多可获得10个文件。 I would suggest using repartition($"COL") , here COL is the column used to partition the data. 我建议使用repartition($"COL") ,这里COL是用于分区数据的列。 This will ensure that your "huge" data is split based on the partition column used in HIVE. 这将确保您的“巨大”数据基于HIVE中使用的分区列进行拆分。 df.repartition($"COL")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM