简体繁体 English

如何在Spark / Hive中将大部分数据合并到单个目录中

[英]How to coalesce large portioned data into single directory in spark/Hive

原文 2018-01-23 16:20:45 8 1 hadoop/ apache-spark/ dataframe/ hive

I have a requirement, Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10) . 我有一个要求，将大量数据分区并将其插入到Hive中。要绑定此数据，我正在使用DF.Coalesce(10) 。 Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? 现在我要将这部分数据绑定到单个目录，如果使用DF.Coalesce(1) ，性能会降低吗？ or do I have any other process to do so? 还是我有其他任何流程？

1 个解决方案

From what I understand is that you are trying to ensure that there are less no of files per partition. 据我了解，您正在尝试确保每个分区的文件更少。 So, by using coalesce(10) , you will get max 10 files per partition. 因此，通过使用coalesce(10) ，每个分区最多可获得10个文件。 I would suggest using repartition($"COL") , here COL is the column used to partition the data. 我建议使用repartition($"COL") ，这里COL是用于分区数据的列。 This will ensure that your "huge" data is split based on the partition column used in HIVE. 这将确保您的“巨大”数据基于HIVE中使用的分区列进行拆分。 df.repartition($"COL")

如何使用Hive / Spark-SQL生成大型数据集？ - How to generate a large data set using hive / spark-sql?

在蜂巢中如何将数据插入单个文件 - In hive how to insert data into a single file

从现有的外部分区表创建新的配置单元表 - Create new hive table from existing external portioned table

Spark 性能将大型数据集从 Dataframe 保存到 hdfs 或 hive - Spark performance a large data-set save from Dataframe to hdfs or hive

无法使用Spark访问Hive仓库目录 - Unable to access to Hive warehouse directory with Spark

如何使用Spark数据帧将csv数据加载到配置单元中？ - How I can load csv data into hive using Spark dataframes?

如何使用spark在Hive中正确加载数据？ - How do I load data correctly in Hive using spark?

使用Spark API如何处理大型目录树？ - How are large directory trees processed in using the Spark API?

对于Hive MAPJOIN作业，有多少数据被认为“太大”？ - How much data is considered “too large” for a Hive MAPJOIN job?

在哪里可以找到有关蜂巢的大数据？ - Where to find large data for hive?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Hive / Spark-SQL生成大型数据集？ - How to generate a large data set using hive / spark-sql? 在蜂巢中如何将数据插入单个文件 - In hive how to insert data into a single file 从现有的外部分区表创建新的配置单元表 - Create new hive table from existing external portioned table Spark 性能将大型数据集从 Dataframe 保存到 hdfs 或 hive - Spark performance a large data-set save from Dataframe to hdfs or hive 无法使用Spark访问Hive仓库目录 - Unable to access to Hive warehouse directory with Spark 如何使用Spark数据帧将csv数据加载到配置单元中？ - How I can load csv data into hive using Spark dataframes? 如何使用spark在Hive中正确加载数据？ - How do I load data correctly in Hive using spark? 使用Spark API如何处理大型目录树？ - How are large directory trees processed in using the Spark API? 对于Hive MAPJOIN作业，有多少数据被认为“太大”？ - How much data is considered “too large” for a Hive MAPJOIN job? 在哪里可以找到有关蜂巢的大数据？ - Where to find large data for hive?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM