将多个通配符路径读入 DataFlow PCollection

Question

In my situation, I have a bunch of events that are stored in small files in Storage under a Date folder.在我的情况下，我有一堆事件存储在 Storage 的 Date 文件夹下的小文件中。 My data might look like this:我的数据可能如下所示：

2022-01-01/file1.json
2022-01-01/file2.json
2022-01-01/file3.json
2022-01-01/file4.json
2022-01-01/file5.json

2022-01-02/file6.json
2022-01-02/file7.json
2022-01-02/file8.json

2022-01-03/file9.json
2022-01-03/file10.json

The DataFlow job will take start and end date as input, and needs to read all files within that date range. DataFlow 作业将开始和结束日期作为输入，并且需要读取该日期范围内的所有文件。

I am working off of this guide: https://pavankumarkattamuri.medium.com/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831我正在使用本指南： https://pavankumarkattamuri.medium.com/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831

I see there is a way to load a list of files into a PCollection:我看到有一种方法可以将文件列表加载到 PCollection 中：

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/folder_1/file.csv', 'gs://bucket_2/data.csv']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

I also see there is a way to specify wildcards, but I haven't seen them used together.我还看到有一种方法可以指定通配符，但我还没有看到它们一起使用。

Wondering if beam.Create() supports wildcards in the file list?想知道 beam.Create() 是否支持文件列表中的通配符？ This is my solution:这是我的解决方案：

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

Have not tried this yet as I'm not sure if it's the best approach and don't see any examples online of anything similar.还没有尝试过这个，因为我不确定这是否是最好的方法，也没有在网上看到任何类似的例子。 Wondering if I'm going in the right direction?想知道我的方向是否正确？

Answer 1

You can use the following approach if it's not possible to use a single url with a wildcard:如果无法使用带通配符的单个 url，则可以使用以下方法：

def run():
    # argument parser
    # pipeline options, google_cloud_options
 
    with beam.Pipeline(options=pipeline_options) as p:
        file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']

        for i, file in enumerate(file_list):
            (p
               | f"Read Text {i}" >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
               | "transform" >> beam.Map(lambda x: x)
               | "write to GCS" >> WriteToText('gs://bucket_3/output'))

We do a foreach on the file paths with wildcard.我们使用通配符对文件路径执行 foreach。 For each element we apply a pipeline with read and write.对于每个元素，我们应用一个具有读写功能的管道。

将多个通配符路径读入 DataFlow PCollection

问题描述

1 个解决方案

解决方案1
0 2023-01-10 21:56:55

将多个通配符路径读入 DataFlow PCollection

问题描述

1 个解决方案

解决方案1 0 2023-01-10 21:56:55

解决方案1
0 2023-01-10 21:56:55