简体   繁体   中英

Best way to trigger and window python beam process that read from google storage on dataflow

I never used beam before, and the whole trigger and window stuff kind of confuse me.

I need to write a program that would run on dataflow, and reads from google storage from a path like this: node-<num>/<table_name>/<timestamp>/file (I have multiple from nodes, the tables name are the same per node and I have one file per timestamp) also files are being uploaded there continually. (I would love to avoid using pubsub since I work for a small company and its more money...)

Now since there are multiple nodes there could be some duplicates in the files so I do want to group them by timestamp and from what I've read I need to take that in account to the windowing.

So how should I trigger and window it so that it would run "forever" and with a way for me to group the files by timestamp and remove duplicates?

Thanks a lot!

As documented inFile processing patterns , continuous read mode is not supported in Python.

You need to use the Java SDK. And you can assign timestamps to each file name matched manually.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM