简体   繁体   English

Flink FlatMapFunction 读取文件的方法

[英]How to Read Files in Flink FlatMapFunction

I am building a Flink pipeline and based on live input data need to read records from archive files in a RichFlatMapFunction (eg each day I want to read files from the previous day and week).我正在构建一个 Flink 管道,并且基于实时输入数据需要从 RichFlatMapFunction 中的存档文件中读取记录(例如,我每天都想读取前一天和前一周的文件)。 I'm wondering what is the best way to do that?我想知道最好的方法是什么?

I could use the Hadoop APIs directly, so that is what I'm trying next.我可以直接使用 Hadoop API,这就是我接下来要尝试的。

That would be something like this:那将是这样的:

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;

class LoadHistory(
  var basePath: String,
  var pathTemplate: String,
) extends RichFlatMapFunction[(TypeAlias.GridId, TypeAlias.Timestamp), ArchiveRecord] {

  // see
  // https://programmerall.com/article/34422316834/
  // https://stackoverflow.com/questions/37085528/hadoop-with-binary-files
  // https://data-flair.training/blogs/hdfs-data-read-operation

  val fileSystem = FileSystem.get(new conf.Configuration())

  def formatPath(pathTemplate: String, gridId: TypeAlias.GridId, archiveDate: TypeAlias.Timestamp): String = ???

  override def flatMap(value: (TypeAlias.GridId, TypeAlias.Timestamp), out: Collector[ArchiveRecord]): Unit = {
    val pathStr = formatPath(pathTemplate, value._1, value._2)
    val path = new Path(pathStr)

    if (!fileSystem.exists(path)) {
      return
    }
    
    val in: FSDataInputStream = fileSystem.open(path)
    if (pathStr.endsWith(".protobuf")) {
      // TODO read file
    } else {
      assert(pathStr.endsWith(".lz4"))
      // TODO read file
    }
  }
}

I'm new with Hadoop, so I figure I'll need to configure it before reading data from cloud storage (eg replace new Configuration() with something meaningful).我是 Hadoop 的新手,所以我想我需要在从云存储读取数据之前对其进行配置(例如,将new Configuration()替换为有意义的内容)。 I know Flink uses Hadoop to read files internally, so I am wondering if I can access the configuration or configured HadoopFileSystem object being used by Flink at runtime.我知道 Flink 使用 Hadoop 在内部读取文件,所以我想知道我是否可以在运行时访问 Flink 使用的配置或配置的 HadoopFileSystem object。

Previously I tried starting a Flink batch job inside the FlatMapFunction (ending with env.collect), but it seems to have resulted in thread-locking (job 2 won't start until job 1 is done).以前我尝试在 FlatMapFunction 中启动 Flink 批处理作业(以 env.collect 结尾),但它似乎导致了线程锁定(作业 2 在作业 1 完成之前不会启动)。

I dug into the Flink source code a little bit and found a way to get an initialized org.apache.flink.core.fs.FileSystem object from a org.apache.flink.core.fs.Path .我深入研究了 Flink 源代码,找到了一种从org.apache.flink.core.fs.Path获取初始化org.apache.flink.core.fs.FileSystem object 的方法。 Then that can be used to read the files:然后可以用来读取文件:

import org.apache.flink.core.fs.Path;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.FSDataInputStream;

class LoadHistory(
  var basePath: String,
  var pathTemplate: String,
) extends RichFlatMapFunction[(TypeAlias.GridId, TypeAlias.Timestamp), ArchiveRecord] {
  
  val fileSystem = new Path(basePath).getFileSystem()

  def formatPath(gridId: TypeAlias.GridId, archiveDate: TypeAlias.Timestamp): String = ???

  override def flatMap(value: (TypeAlias.GridId, TypeAlias.Timestamp), out: Collector[ArchiveRecord]): Unit = {
    val pathStr = formatPath(value._1, value._2)
    val path = new Path(pathStr)

    if (!fileSystem.exists(path)) {
      return
    }
    
    val in: FSDataInputStream = fileSystem.open(path)
    
    if (pathStr.endsWith(".protobuf")) {
      // TODO read file
    } else {
      assert(pathStr.endsWith(".lz4"))
      // TODO read file
    }
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM