简体   繁体   中英

Reading multiple wildcard paths into a DataFlow PCollection

In my situation, I have a bunch of events that are stored in small files in Storage under a Date folder. My data might look like this:

2022-01-01/file1.json
2022-01-01/file2.json
2022-01-01/file3.json
2022-01-01/file4.json
2022-01-01/file5.json

2022-01-02/file6.json
2022-01-02/file7.json
2022-01-02/file8.json

2022-01-03/file9.json
2022-01-03/file10.json

The DataFlow job will take start and end date as input, and needs to read all files within that date range.

I am working off of this guide: https://pavankumarkattamuri.medium.com/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831

I see there is a way to load a list of files into a PCollection:

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/folder_1/file.csv', 'gs://bucket_2/data.csv']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

I also see there is a way to specify wildcards, but I haven't seen them used together.

Wondering if beam.Create() supports wildcards in the file list? This is my solution:

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

Have not tried this yet as I'm not sure if it's the best approach and don't see any examples online of anything similar. Wondering if I'm going in the right direction?

You can use the following approach if it's not possible to use a single url with a wildcard:

def run():
    # argument parser
    # pipeline options, google_cloud_options
 
    with beam.Pipeline(options=pipeline_options) as p:
        file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']

        for i, file in enumerate(file_list):
            (p
               | f"Read Text {i}" >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
               | "transform" >> beam.Map(lambda x: x)
               | "write to GCS" >> WriteToText('gs://bucket_3/output'))

We do a foreach on the file paths with wildcard. For each element we apply a pipeline with read and write.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM