Reading multiple wildcard paths into a DataFlow PCollection

Question

In my situation, I have a bunch of events that are stored in small files in Storage under a Date folder. My data might look like this:

2022-01-01/file1.json
2022-01-01/file2.json
2022-01-01/file3.json
2022-01-01/file4.json
2022-01-01/file5.json

2022-01-02/file6.json
2022-01-02/file7.json
2022-01-02/file8.json

2022-01-03/file9.json
2022-01-03/file10.json

The DataFlow job will take start and end date as input, and needs to read all files within that date range.

I am working off of this guide: https://pavankumarkattamuri.medium.com/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831

I see there is a way to load a list of files into a PCollection:

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/folder_1/file.csv', 'gs://bucket_2/data.csv']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

I also see there is a way to specify wildcards, but I haven't seen them used together.

Wondering if beam.Create() supports wildcards in the file list? This is my solution:

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

Have not tried this yet as I'm not sure if it's the best approach and don't see any examples online of anything similar. Wondering if I'm going in the right direction?

Answer 1

You can use the following approach if it's not possible to use a single url with a wildcard:

def run():
    # argument parser
    # pipeline options, google_cloud_options
 
    with beam.Pipeline(options=pipeline_options) as p:
        file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']

        for i, file in enumerate(file_list):
            (p
               | f"Read Text {i}" >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
               | "transform" >> beam.Map(lambda x: x)
               | "write to GCS" >> WriteToText('gs://bucket_3/output'))

We do a foreach on the file paths with wildcard. For each element we apply a pipeline with read and write.

Reading multiple wildcard paths into a DataFlow PCollection

Question

1 answers

solution1
0 2023-01-10 21:56:55

Reading multiple wildcard paths into a DataFlow PCollection

Question

1 answers

solution1 0 2023-01-10 21:56:55

solution1
0 2023-01-10 21:56:55