Spark on Cluster：读取大量小型 avro 文件花费的时间太长而无法列出

Question

I know this problem of reading large number of small files in HDFS have always been an issue and been widely discussed, but bear with me.我知道在 HDFS 中读取大量小文件的问题一直是一个问题并被广泛讨论，但请耐心等待。 Most of the stackoverflow problems dealing with this type of issue concerns with reading a large number of txt files.I'm trying to read a large number of small avro files大多数处理此类问题的 stackoverflow 问题都与读取大量 txt 文件有关。我正在尝试读取大量小型 avro 文件

Plus these reading txt files solutions talk about using WholeTextFileInputFormat or CombineInputFormat ( https://stackoverflow.com/a/43898733/11013878 ) which are RDD implementations, I'm using Spark 2.4 (HDFS 3.0.0) and RDD implementations are generally discouraged and dataframes are preferred.加上这些阅读 txt 文件解决方案谈论使用 WholeTextFileInputFormat 或 CombineInputFormat ( https://stackoverflow.com/a/43898733/11013878 ) 它们是 RDD 实现，我使用的是 Spark 2.4 (HDFS 3.0.0) 并且通常不鼓励 RDD 实现和数据帧是首选。 I would prefer using dataframes but am open to RDD implementations as well.我更喜欢使用数据帧，但也对 RDD 实现持开放态度。

I've tried unioning dataframes as suggested by Murtaza, but on a large number of files I get OOM error ( https://stackoverflow.com/a/32117661/11013878 )我已经尝试按照 Murtaza 的建议合并数据帧，但是在大量文件上我收到 OOM 错误（ https://stackoverflow.com/a/32117661/11013878 ）

I'm using the following code我正在使用以下代码

val filePaths = avroConsolidator.getFilesInDateRangeWithExtension //pattern:filePaths: Array[String] 
//I do need to create a list of file paths as I need to filter files based on file names. Need this logic for some upstream process
//example : Array("hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1530.avro","hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1531.avro","hdfs://server123:8020/source/Avro/weblog/2019/06/03/20190603_1532.avro")
val df_mid = sc.read.format("com.databricks.spark.avro").load(filePaths: _*)
      val df = df_mid
        .withColumn("dt", date_format(df_mid.col("timeStamp"), "yyyy-MM-dd"))
        .filter("dt != 'null'")

      df
        .repartition(partitionColumns(inputs.logSubType).map(new org.apache.spark.sql.Column(_)):_*)
        .write.partitionBy(partitionColumns(inputs.logSubType): _*)
        .mode(SaveMode.Append)
        .option("compression","snappy")
        .parquet(avroConsolidator.parquetFilePath.toString)

It took 1.6 mins to list 183 small files at the job level 1.6分钟在作业级别列出183个小文件

Weirdly enough my stage UI page just shows 3s(dont understand why)奇怪的是，我的舞台 UI 页面只显示 3 秒（不明白为什么）

The avro files are stored in yyyy/mm/dd partitions: hdfs://server123:8020/source/Avro/weblog/2019/06/03 avro 文件存储在 yyyy/mm/dd 分区：hdfs://server123:8020/source/Avro/weblog/2019/06/03

Is there any way I can speed the Listing of leaf files, as you can from screenshot it takes only 6s to consilidate into parquet files, but 1.3 mins to list the files有什么方法可以加快叶文件的列表，因为您可以从屏幕截图中将其合并为镶木地板文件只需 6 秒，但列出文件需要 1.3 分钟

Answer 1

Since it's taking too long to read large number of small files, I took a step back, and created RDDs using CombineFileInputFormat .由于读取大量小文件花费的时间太长，我退后一步，使用CombineFileInputFormat创建了RDD 。 This This InputFormat works well with small files, because it packs many of them into one split so there are fewer mappers, and each mapper has more data to process. This InputFormat 适用于小文件，因为它将许多文件打包成一个拆分，因此映射器较少，并且每个映射器都有更多的数据要处理。

Here's what I did:这是我所做的：

def createDataFrame(filePaths: Array[Path], sc: SparkSession, inputs: AvroConsolidatorInputs): DataFrame = {

   val job: Job = Job.getInstance(sc.sparkContext.hadoopConfiguration)
   FileInputFormat.setInputPaths(job, filePaths: _*)
   val sqlType = SchemaConverters.toSqlType(getSchema(inputs.logSubType))

   val rddKV = sc.sparkContext.newAPIHadoopRDD(
                   job.getConfiguration,
                   classOf[CombinedAvroKeyInputFormat[GenericRecord]],
                   classOf[AvroKey[GenericRecord]],
                   classOf[NullWritable])

   val rowRDD = rddKV.mapPartitions(
                  f = (iter: Iterator[(AvroKey[GenericRecord], NullWritable)]) =>
                       iter.map(_._1.datum()).map(genericRecordToRow(_, sqlType))
                       , preservesPartitioning = true)

   val df = sc.sqlContext.createDataFrame(rowRDD , 
              sqlType.dataType.asInstanceOf[StructType])
   df

CombinedAvroKeyInputFormat is user defined class which extends CombineFileInputFormat and puts 64MB of data in a single split. CombinedAvroKeyInputFormat 是用户定义的类，它扩展了 CombineFileInputFormat 并将 64MB 的数据放入单个拆分中。

object CombinedAvroKeyInputFormat {

  class CombinedAvroKeyRecordReader[T](var inputSplit: CombineFileSplit, context: TaskAttemptContext, idx: Integer)
    extends AvroKeyRecordReader[T](AvroJob.getInputKeySchema(context.getConfiguration))
  {
    @throws[IOException]
    @throws[InterruptedException]
    override def initialize(inputSplit: InputSplit, context: TaskAttemptContext): Unit = {
      this.inputSplit = inputSplit.asInstanceOf[CombineFileSplit]
      val fileSplit = new FileSplit(this.inputSplit.getPath(idx),
                                    this.inputSplit.getOffset(idx),
                                    this.inputSplit.getLength(idx),
                                    this.inputSplit.getLocations)
      super.initialize(fileSplit, context)
    }
  }

}

/*
 * The class CombineFileInputFormat is an abstract class with no implementation, so we must create a subclass to support it;
 * We’ll name the subclass CombinedAvroKeyInputFormat. The subclass will initiate a delegate CombinedAvroKeyRecordReader that extends AvroKeyRecordReader
 */

class CombinedAvroKeyInputFormat[T] extends CombineFileInputFormat[AvroKey[T], NullWritable] {
  val logger = Logger.getLogger(AvroConsolidator.getClass)
  setMaxSplitSize(67108864)
  def createRecordReader(split: InputSplit, context: TaskAttemptContext): RecordReader[AvroKey[T], NullWritable] = {
    val c = classOf[CombinedAvroKeyInputFormat.CombinedAvroKeyRecordReader[_]]
    val inputSplit = split.asInstanceOf[CombineFileSplit]

    /*
     * CombineFileRecordReader is a built in class that pass each split to our class CombinedAvroKeyRecordReader
     * When the hadoop job starts, CombineFileRecordReader reads all the file sizes in HDFS that we want it to process,
     * and decides how many splits base on the MaxSplitSize
     */
    return new CombineFileRecordReader[AvroKey[T], NullWritable](
      inputSplit,
      context,
      c.asInstanceOf[Class[_ <: RecordReader[AvroKey[T], NullWritable]]])
  }
}

This made reading of small files a lot faster这使得读取小文件的速度快了很多

Answer 2

I had a similar issue reading 100s of small avro files from AWS S3 with:我在从 AWS S3 读取 100 个小 avro 文件时遇到了类似的问题：

spark.read.format("avro").load(<file_directory_path_containing_many_avro_files>)

The job would hang at various points after completing most of the scheduled tasks.完成大部分计划任务后，该作业将在不同时间点挂起。 For example it would run quickly completing 110 tasks out of 111 in 25 seconds and hang at 110 one time, and on the next try would hang at task 98 out of 111 tasks.例如，它会在 25 秒内快速完成 111 个任务中的 110 个任务并在 110 个处挂起，而在下一次尝试时将在 111 个任务中的第 98 个任务处挂起。 It did not progress pass the hang point.它没有进展通过挂点。

After reading about similar issues here: https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/在这里阅读类似问题后： https : //blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-running-tasks/

Which references the spark configuration guide here:其中引用了此处的火花配置指南：

spark configuration guide火花配置指南

Although not a solution to the original cause of the hang, the spark configuration below proved to be a quick fix and a workaround.虽然不是解决挂起的原始原因，但下面的 spark 配置被证明是一种快速修复和解决方法。

Setting spark.speculation to true solved the issue.将spark.speculation设置为 true 解决了这个问题。

Spark on Cluster：读取大量小型 avro 文件花费的时间太长而无法列出

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-08-01 17:05:10

解决方案2
0 2020-11-03 17:35:31

Spark on Cluster：读取大量小型 avro 文件花费的时间太长而无法列出

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-08-01 17:05:10

解决方案2 0 2020-11-03 17:35:31

解决方案1
1 已采纳 2019-08-01 17:05:10

解决方案2
0 2020-11-03 17:35:31