简体繁体中英

How to coalesce large portioned data into single directory in spark/Hive

原文 2018-01-23 16:20:45 4 1 hadoop/ apache-spark/ dataframe/ hive

I have a requirement, Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10) . Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?

1 answers

From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10) , you will get max 10 files per partition. I would suggest using repartition($"COL") , here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

How to generate a large data set using hive / spark-sql?

In hive how to insert data into a single file

Create new hive table from existing external portioned table

Spark performance a large data-set save from Dataframe to hdfs or hive

Unable to access to Hive warehouse directory with Spark

How I can load csv data into hive using Spark dataframes?

How do I load data correctly in Hive using spark?

How are large directory trees processed in using the Spark API?

How much data is considered “too large” for a Hive MAPJOIN job?

Where to find large data for hive?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to generate a large data set using hive / spark-sql? In hive how to insert data into a single file Create new hive table from existing external portioned table Spark performance a large data-set save from Dataframe to hdfs or hive Unable to access to Hive warehouse directory with Spark How I can load csv data into hive using Spark dataframes? How do I load data correctly in Hive using spark? How are large directory trees processed in using the Spark API? How much data is considered “too large” for a Hive MAPJOIN job? Where to find large data for hive?

Related Tags

How to coalesce large portioned data into single directory in spark/Hive

Question

1 answers

solution1 1 2018-01-23 16:44:04

solution1
1 2018-01-23 16:44:04