简体   繁体   中英

apache beam streaming pipeline to watch gcs file regex

I have a streaming beam pipeline where I try to monitor multiple globs/regex patterns. Few of those patterns already have files matching and few of the patterns will be generated in future.

PCollection<String> fileGlobs = p.apply(Create.of(filePatterns));

PCollection<Metadata> f = fileGlobs.apply("MatchALL",
    FileIO.matchAll().continuously(
        Duration.standardSeconds(10),
        Watch.Growth.afterTimeSinceNewOutput(Duration.standardHours(1))));

f = .. some more transformations and then write to gcs ..

The expected behaviour is to match the existing files with patterns provided and also watch over them to see if new files matching those patterns are being written to GCS. The termination condition i enforce is don't try to match patterns if the last file generated that matched that particular pattern was more than an hour ago. The observed behaviour is we are matching a lot of files but the transforms after getting unbounded f are not being executed at all. The logs just show

polling returned 681384 results, of which 681384 were new. The output is incomplete.

I give 2 different regex pattern to watch over. One of the existing regex pattern already had ~500k files matching and more were being added every minute for which i never saw an output and just the above log line. The second regex pattern was matching 0 files (when starting pipeline) but as soon as at some future point it started matching with newly coming files , those output files were being written to gcs.

Can someone explain this behaviour and if i am using match continuously correctly . I don't create any windows here because my use case is pretty simple , stream files -> read files -> filter some events -> write back those files to some gcs bucket.

This is a bug in Splittable DoFn that affects the Watch transform in case a single round of polling takes more than 10 seconds - which happens when watching a filepattern that matches a very large number of files. The bug causes no output to be produced, because the transform gets checkpointed before it makes any progress, so when it resumes from the checkpoint, it's "back to square 1" in a sense.

Please follow JIRA for updates and a suggested workaround.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM