简体   繁体   English

将多个通配符路径读入 DataFlow PCollection

[英]Reading multiple wildcard paths into a DataFlow PCollection

In my situation, I have a bunch of events that are stored in small files in Storage under a Date folder.在我的情况下,我有一堆事件存储在 Storage 的 Date 文件夹下的小文件中。 My data might look like this:我的数据可能如下所示:

2022-01-01/file1.json
2022-01-01/file2.json
2022-01-01/file3.json
2022-01-01/file4.json
2022-01-01/file5.json

2022-01-02/file6.json
2022-01-02/file7.json
2022-01-02/file8.json

2022-01-03/file9.json
2022-01-03/file10.json

The DataFlow job will take start and end date as input, and needs to read all files within that date range. DataFlow 作业将开始和结束日期作为输入,并且需要读取该日期范围内的所有文件。

I am working off of this guide: https://pavankumarkattamuri.medium.com/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831我正在使用本指南: https://pavankumarkattamuri.medium.com/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831

I see there is a way to load a list of files into a PCollection:我看到有一种方法可以将文件列表加载到 PCollection 中:

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/folder_1/file.csv', 'gs://bucket_2/data.csv']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

I also see there is a way to specify wildcards, but I haven't seen them used together.我还看到有一种方法可以指定通配符,但我还没有看到它们一起使用。

Wondering if beam.Create() supports wildcards in the file list?想知道 beam.Create() 是否支持文件列表中的通配符? This is my solution:这是我的解决方案:

def run(argv=None):
    # argument parser
    # pipeline options, google_cloud_options
    
    file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']
    
    p = beam.Pipeline(options=pipeline_options)
    
    p1 = p | "create PCol from list" >> beam.Create(file_list) \
        | "read files" >> ReadAllFromText() \
        | "transform" >> beam.Map(lambda x: x) \
        | "write to GCS" >> WriteToText('gs://bucket_3/output')

    result = p.run()
    result.wait_until_finish()

Have not tried this yet as I'm not sure if it's the best approach and don't see any examples online of anything similar.还没有尝试过这个,因为我不确定这是否是最好的方法,也没有在网上看到任何类似的例子。 Wondering if I'm going in the right direction?想知道我的方向是否正确?

You can use the following approach if it's not possible to use a single url with a wildcard:如果无法使用带通配符的单个 url,则可以使用以下方法:

def run():
    # argument parser
    # pipeline options, google_cloud_options
 
    with beam.Pipeline(options=pipeline_options) as p:
        file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']

        for i, file in enumerate(file_list):
            (p
               | f"Read Text {i}" >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
               | "transform" >> beam.Map(lambda x: x)
               | "write to GCS" >> WriteToText('gs://bucket_3/output'))

We do a foreach on the file paths with wildcard.我们使用通配符对文件路径执行 foreach。 For each element we apply a pipeline with read and write.对于每个元素,我们应用一个具有读写功能的管道。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 GCP 数据流:打印 PCollection 数据 - GCP Dataflow : print PCollection data 如何转换 PCollection<row> 使用 Java 到数据流 Apache 中的 Integer</row> - How to convert PCollection<Row> to Integer in Dataflow Apache beam using Java 将多个 S3 文件夹/路径读入 PySpark - Reading Multiple S3 Folders / Paths Into PySpark 在 GCS 中读取 Avro 文件作为 PCollection<genericrecord></genericrecord> - Reading Avro files in GCS as PCollection<GenericRecord> 在数据流托管服务中运行时数据流不读取 PubSub 消息 - Dataflow not reading PubSub messages when running in Dataflow Managed Service 数据流模板 - Splunk 不从主题中读取数据 - Dataflow template - Splunk not reading data from topic Google Dataflow - 将数据保存到多个 BigQuery 表中 - Google Dataflow - save the data into multiple BigQuery tables ADF/Dataflow - 将多个 CSV 转换为 Parquet - ADF / Dataflow - Convert Multiple CSV to Parquet 向 PCollection 添加增量索引? - Add incremental index to a PCollection? 如何为 Azure 数据工厂中的文件位置(通配符文件路径)提供动态表达式路径? - How to give dynamic expression path for file location (Wildcard file paths) in Azure data factory?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM