简体繁体中英

Best way to trigger and window python beam process that read from google storage on dataflow

原文 2021-06-14 15:37:47 5 1 python/ google-cloud-dataflow/ apache-beam/ apache-beam-io

I never used beam before, and the whole trigger and window stuff kind of confuse me.

I need to write a program that would run on dataflow, and reads from google storage from a path like this: node-<num>/<table_name>/<timestamp>/file (I have multiple from nodes, the tables name are the same per node and I have one file per timestamp) also files are being uploaded there continually. (I would love to avoid using pubsub since I work for a small company and its more money...)

Now since there are multiple nodes there could be some duplicates in the files so I do want to group them by timestamp and from what I've read I need to take that in account to the windowing.

So how should I trigger and window it so that it would run "forever" and with a way for me to group the files by timestamp and remove duplicates?

Thanks a lot!

1 answers

As documented inFile processing patterns , continuous read mode is not supported in Python.

You need to use the Java SDK. And you can assign timestamps to each file name matched manually.

Read Shapefile from Google Cloud Storage using Dataflow + Beam + Python

How to read multiple JSON files from GCS bucket in google dataflow apache beam python

Apache Beam/Google dataflow Python streaming autoscaling

Read and write avro files by inferring schema using Python SDK in Google Cloud Dataflow - Apache Beam

How to query datastore from dataflow/beam in python

What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google cloud Dataflow

Beam / Dataflow Custom Python job - Cloud Storage to PubSub

Read multiline JSON using apache beam / google cloud dataflow

How to use Google Pub/Sub with Google Dataflow/Beam using Python?

TypeError when connecting to Google Cloud BigQuery from Apache Beam Dataflow in Python?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Read Shapefile from Google Cloud Storage using Dataflow + Beam + Python How to read multiple JSON files from GCS bucket in google dataflow apache beam python Apache Beam/Google dataflow Python streaming autoscaling Read and write avro files by inferring schema using Python SDK in Google Cloud Dataflow - Apache Beam How to query datastore from dataflow/beam in python What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google cloud Dataflow Beam / Dataflow Custom Python job - Cloud Storage to PubSub Read multiline JSON using apache beam / google cloud dataflow How to use Google Pub/Sub with Google Dataflow/Beam using Python? TypeError when connecting to Google Cloud BigQuery from Apache Beam Dataflow in Python?

Related Tags

Best way to trigger and window python beam process that read from google storage on dataflow

Question

1 answers

solution1 0 2021-06-14 18:45:58

solution1
0 2021-06-14 18:45:58