简体   繁体   中英

Google dataflow job to read avro files from Cloud storage based on file patterns

Say given the files in gcs stored in the following formats: -.avro . Trying to use read files in google dataflow job using apache beam's FileIO.matchAll library to read files based on timestamp interval. Example, File in gcs :

   gs://test-bucket/abc_20200101000000.txt
    gs://test-bucket/abc_20200201000000.txt
    gs://test-bucket/abc_20200301000000.txt

Now we want to fetch all the files who are greater than timestamp 20200101000000 till current timestamp, what file pattern can i use?

我不知道,如果你能做到这一点与正则表达式,但你应该能够添加一个ParDo到您的管道下面FileIO.matchAll到过滤元件(类型MatchResult.Metadata基于文件)( MatchResult.Metadata.resourceId() )。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM