简体   繁体   中英

apache_beam, read data from GCS buckets during pipeline

I have a pub/sub topic, which gets message as soon as a file is created in the bucket, with the streaming pipeline, I am able to get the object path. Created file is AVRO. Now in my pipeline I want to read all the content of the different files, that was captured in the pub/sub messages in the pvs step.

Any direction to the solution will be appriciated, using given functions in apache_beam.

Using python sdk

Was able to do this using ReadAllFromAvro() Ref: https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.io.avroio.html?highlight=readallfromavro#apache_beam.io.avroio.ReadAllFromAvro

    with beam.Pipeline(options=self.pipeline_options) as pipeline:
        (
            pipeline
            | "Read from Pub/Sub" >> beam.io.MultipleReadFromPubSub(pubsub_subscribers, with_attributes=True)
            | "Windowing Transformation" >> GroupMessagesByFixedWindowsTransform(
            self.streamverse_options.window_size)
            | "Read From GCS" >> beam.io.ReadAllFromAvro()
            | beam.Map(print)
        )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM