Scala - 如何合并 HDFS 位置的增量文件

Question

My requirement is that I've a multiple HDFS location which ingest files from Kafka every hour.我的要求是我有一个多个 HDFS 位置，每小时从 Kafka 摄取文件。 So for each directory how to merge all the files of a particular timestamp to current timestamp as a single parquet file and next time merge the files from last merged timestamp to current timestamp only, and repeat the same in future.因此，对于每个目录，如何将特定时间戳的所有文件合并到当前时间戳作为单个 parquet 文件，下次将文件从上次合并的时间戳合并到当前时间戳，并在将来重复相同的操作。 This all I have to do in Spark Scala job so can't go with normal shell script.这一切都是我在 Spark Scala 工作中要做的，所以不能使用普通的 shell 脚本。 Any suggestions are appreciated.任何建议表示赞赏。

Answer 1

Here is a code snippet that would help to get the thing done.这是一个有助于完成工作的代码片段。

First step is to the get the list of files per date as a Map.第一步是获取每个日期的文件列表作为地图。 (Map[String, List[String]]) where key is Date and value is list of files with a same date. (Map[String, List[String]])其中键是日期，值是具有相同日期的文件列表。 Date is taken from the modification timestamp of the HDFS file.日期取自 HDFS 文件的修改时间戳。

Note: Tested the code using local path, give the right HDFS path / url as required.注意：使用本地路径测试代码，根据需要提供正确的 HDFS 路径/url。

While writing the output, there is no direct option to specify the target filename, but you can specify the target directory specific to each date.在写入输出时，没有直接选项来指定目标文件名，但您可以指定特定于每个日期的目标目录。 Code makes us FileSystem APIs to rename the file to the desired and delete the temp output folders created per date.代码让我们使用 FileSystem API 将文件重命名为所需的文件并删除每个日期创建的临时输出文件夹。

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.SparkSession

import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
import org.apache.spark.SparkContext
import org.joda.time.format.DateTimeFormat


object MergeFiles {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("Merging files day wise in a directory")
      .master("local[2]")
      .getOrCreate()

    val inputDir = "/Users/sujesh/test_data"
    val outputDir = "/Users/sujesh/output_data"

    val hadoopConf = spark.sparkContext.hadoopConfiguration
    val fs = FileSystem.get(hadoopConf)

    val filesPerDate = getFiles(inputDir, fs)

    filesPerDate
      .foreach { m =>
        spark
          .read
          .format("csv")
          .option("inferSchema", false)
          .option("header", false)
          .load(m._2:_*)
          .repartition(1)
          .write
          .format("csv")
          .save(s"$outputDir/${m._1}")

        val file = fs.globStatus(new Path(s"$outputDir/${m._1}/part*.csv"))(0).getPath.getName
        fs.rename(new Path(s"$outputDir/${m._1}/$file"), new Path(s"$outputDir/${m._1}.csv"))
        fs.delete(new Path(s"$outputDir/${m._1}"), true)
      }
  }

  /*
    Get the list of files group by date
    date is taken from file's modification timestamp
   */
  def getFiles(dir: String, fs: FileSystem) = {
    fs
      .globStatus(new Path(s"$dir/*.csv"))
      .map { f: FileStatus =>
        (DateTimeFormat.forPattern("yyyyMMdd").print(f.getModificationTime), f.getPath.toUri.getRawPath)
       }.groupBy(_._1)
       .map { case (k,v) => (k -> v.map(_._2).toSeq) }
  }
}

You may further optimise the code after tests and convert the file rename code to a util if it has to be re-used.您可以在测试后进一步优化代码，并将文件重命名代码转换为实用程序（如果必须重复使用）。 Have put all options such as inferSchema or header to false. inferSchema或header等所有选项设置为 false。 Use them as you need.根据需要使用它们。 This approach should work for other formats of files too.这种方法也适用于其他格式的文件。

Note: If you are repeatedly doing this process in the same directory, further tweaks would be required as the newly created files will have the latest timestamp.注意：如果您在同一目录中重复执行此过程，则需要进一步调整，因为新创建的文件将具有最新的时间戳。 Hence if this is not run daily, you need to explicitly update the modification timestamp of the file too or ignore the files having a pattern of file name such as yyyyMMdd.csv因此，如果这不是每天运行，您也需要明确更新文件的修改时间戳或忽略具有文件名模式的文件，例如yyyyMMdd.csv

Scala - 如何合并 HDFS 位置的增量文件

问题描述

1 个解决方案

解决方案1
0 2020-10-20 01:48:11

Scala - 如何合并 HDFS 位置的增量文件

问题描述

1 个解决方案

解决方案1 0 2020-10-20 01:48:11

解决方案1
0 2020-10-20 01:48:11