简体   繁体   中英

Reading HDFS small size partitions?

Our data loads into hdfs with partition columns as date daily. The issue is each partition has small file size less than 50mb. So when we read the data from all these partition to load the data to next table take hours. How can we address this issue?

I'd suggest you to run end of the day job to coalesce/combine and make a large file which is significantly bigger in size for processing in spark, before reading from spark.

Further reading cloudera blog/docs to address these problems Partition Management in Hadoop where several techniques were discussed to address these problems like

    1. Merge partitions on selected tables
    1. Archiving cold data
    1. Delete partitions

Select one of the technique discussed in cloudera blog to match your kind of requirements. Hope this helps!


Other good options Typical use case is using open source delta lake/ if you are using databricks go for their delta lake for getting rich set of features.. .

Example maven coordinates.

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>0.6.1</version>
</dependency>

using delta. lake you can insert/update/delete the data as you want. it will reduce maintenance steps...

Compacting Small Files in Delta Lakes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM