Flink FlatMapFunction 读取文件的方法

Question

I am building a Flink pipeline and based on live input data need to read records from archive files in a RichFlatMapFunction (eg each day I want to read files from the previous day and week).我正在构建一个 Flink 管道，并且基于实时输入数据需要从 RichFlatMapFunction 中的存档文件中读取记录（例如，我每天都想读取前一天和前一周的文件）。 I'm wondering what is the best way to do that?我想知道最好的方法是什么？

I could use the Hadoop APIs directly, so that is what I'm trying next.我可以直接使用 Hadoop API，这就是我接下来要尝试的。

That would be something like this:那将是这样的：

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;

class LoadHistory(
  var basePath: String,
  var pathTemplate: String,
) extends RichFlatMapFunction[(TypeAlias.GridId, TypeAlias.Timestamp), ArchiveRecord] {

  // see
  // https://programmerall.com/article/34422316834/
  // https://stackoverflow.com/questions/37085528/hadoop-with-binary-files
  // https://data-flair.training/blogs/hdfs-data-read-operation

  val fileSystem = FileSystem.get(new conf.Configuration())

  def formatPath(pathTemplate: String, gridId: TypeAlias.GridId, archiveDate: TypeAlias.Timestamp): String = ???

  override def flatMap(value: (TypeAlias.GridId, TypeAlias.Timestamp), out: Collector[ArchiveRecord]): Unit = {
    val pathStr = formatPath(pathTemplate, value._1, value._2)
    val path = new Path(pathStr)

    if (!fileSystem.exists(path)) {
      return
    }
    
    val in: FSDataInputStream = fileSystem.open(path)
    if (pathStr.endsWith(".protobuf")) {
      // TODO read file
    } else {
      assert(pathStr.endsWith(".lz4"))
      // TODO read file
    }
  }
}

I'm new with Hadoop, so I figure I'll need to configure it before reading data from cloud storage (eg replace new Configuration() with something meaningful).我是 Hadoop 的新手，所以我想我需要在从云存储读取数据之前对其进行配置（例如，将new Configuration()替换为有意义的内容）。 I know Flink uses Hadoop to read files internally, so I am wondering if I can access the configuration or configured HadoopFileSystem object being used by Flink at runtime.我知道 Flink 使用 Hadoop 在内部读取文件，所以我想知道我是否可以在运行时访问 Flink 使用的配置或配置的 HadoopFileSystem object。

Previously I tried starting a Flink batch job inside the FlatMapFunction (ending with env.collect), but it seems to have resulted in thread-locking (job 2 won't start until job 1 is done).以前我尝试在 FlatMapFunction 中启动 Flink 批处理作业（以 env.collect 结尾），但它似乎导致了线程锁定（作业 2 在作业 1 完成之前不会启动）。

Answer 1

I dug into the Flink source code a little bit and found a way to get an initialized org.apache.flink.core.fs.FileSystem object from a org.apache.flink.core.fs.Path .我深入研究了 Flink 源代码，找到了一种从org.apache.flink.core.fs.Path获取初始化org.apache.flink.core.fs.FileSystem object 的方法。 Then that can be used to read the files:然后可以用来读取文件：

import org.apache.flink.core.fs.Path;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.FSDataInputStream;

class LoadHistory(
  var basePath: String,
  var pathTemplate: String,
) extends RichFlatMapFunction[(TypeAlias.GridId, TypeAlias.Timestamp), ArchiveRecord] {
  
  val fileSystem = new Path(basePath).getFileSystem()

  def formatPath(gridId: TypeAlias.GridId, archiveDate: TypeAlias.Timestamp): String = ???

  override def flatMap(value: (TypeAlias.GridId, TypeAlias.Timestamp), out: Collector[ArchiveRecord]): Unit = {
    val pathStr = formatPath(value._1, value._2)
    val path = new Path(pathStr)

    if (!fileSystem.exists(path)) {
      return
    }
    
    val in: FSDataInputStream = fileSystem.open(path)
    
    if (pathStr.endsWith(".protobuf")) {
      // TODO read file
    } else {
      assert(pathStr.endsWith(".lz4"))
      // TODO read file
    }
  }
}

Flink FlatMapFunction 读取文件的方法

问题描述

1 个解决方案

解决方案1
0 2023-01-11 19:13:12

Flink FlatMapFunction 读取文件的方法

问题描述

1 个解决方案

解决方案1 0 2023-01-11 19:13:12

解决方案1
0 2023-01-11 19:13:12