简体   繁体   English

使用Flink获取DataStream的文件名

[英]Get file name of DataStream with Flink

I have a streaming process with flink working with csv files in a single path. 我有一个流式处理过程,flink在单个路径中使用csv文件。 I want to know the file name of each processed file. 我想知道每个已处理文件的文件名。

I am currently using this function to read csv files into the path(dataPath). 我目前正在使用此函数将csv文件读入路径(dataPath)。

val recs:DataStream[CallCenterEvent] = env
          .readFile[CallCenterEvent](
          CsvReader.getReaderFormat[CallCenterEvent](dataPath, c._2),
          dataPath,
          FileProcessingMode.PROCESS_CONTINUOUSLY,
          c._2.fileInterval)
          .uid("source-%s-%s".format(systemConfig.name, c._1))
          .name("%s records reading".format(c._1))

And using this function to obtain the TupleCsvInputFormat. 并使用此函数获取TupleCsvInputFormat。

def getReaderFormat[T <: Product : ClassTag : TypeInformation](dataPath:String, conf:URMConfiguration): TupleCsvInputFormat[T] = {
  val typeInfo = implicitly[TypeInformation[T]]
  val format: TupleCsvInputFormat[T] = new TupleCsvInputFormat[T](new Path(dataPath), typeInfo.asInstanceOf[CaseClassTypeInfo[T]])
  if (conf.quoteCharacter != null && !conf.quoteCharacter.equals(""))
    format.enableQuotedStringParsing(conf.quoteCharacter.charAt(0))
  format.setFieldDelimiter(conf.fieldDelimiter)
  format.setSkipFirstLineAsHeader(conf.ignoreFirstLine)
  format.setLenient(true)

  return format
}       

The process run ok, but I can't find a way to get the file name of each csv file processed. 该过程运行正常,但我找不到一种方法来获取处理的每个csv文件的文件名。

Thanks in advance 提前致谢

I faced similar situation where the I need to know the filename of the record being processed. 我遇到了类似的情况,我需要知道正在处理的记录的文件名。 There is some information in the filename that is not available inside record. 文件名中有一些信息在记录中不可用。 Asking customer to change the record schema is not an option. 要求客户更改记录架构不是一种选择。

I found a way to get access to the underlying source. 我找到了一种访问底层源的方法。 In my case it's FileInputSplit (This has Path info of the source data file) 在我的例子中它是FileInputSplit(这有源数据文件的路径信息)

class MyTextInputFormat(p:Path ) extends TextInputFormat(p) {

     override def readRecord(reusable: String, bytes: Array[Byte], offset: Int, numBytes: Int):String = {
val fileName = {
      if (this.currentSplit != null)      
        this.currentSplit.getPath.getName
      else
         "unknown-file-path"
    }

    //Add FileName to the record!
    super.readRecord(reusable, bytes, offset, numBytes)+","+fileName
  }
}

Now, you can use this in stream setup 现在,您可以在流设置中使用它

val format = new MyTextInputFormat(new Path(srcDir))
format.setDelimiter(prfl.lineSep)
val stream = env.readFile(format, srcDir, FileProcessingMode.PROCESS_CONTINUOUSLY, Time.seconds(10).toMilliseconds

While my situation is little different, this approach should help you as well! 虽然我的情况略有不同,但这种方法也应该对你有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM