如何從 GCP 存儲桶中讀取 Apache Beam 中的多個文件

Question

我正在嘗試使用 Apache Beam 在 GCP 中的多個文件上讀取和應用一些子集。 我准備了兩個管道，它們僅適用於一個文件，但當我在多個文件上嘗試它們時失敗。 除此之外，如果可能的話，我會很方便地將我的管道組合成一個，或者有辦法編排它們以便它們按順序工作。 現在管道在本地工作，但最終目標是使用 Dataflow 運行它們。

我 textio.ReadFromText 和 textio.ReadAllFromText 但在有多個文件的情況下我都無法工作。

def toJson(file):
    with open(file) as f:
        return json.load(f)


 with beam.Pipeline(options=PipelineOptions()) as p:
       files = (p
        | beam.io.textio.ReadFromText("gs://my_bucket/file1.txt.gz", skip_header_lines = 0)
        | beam.io.WriteToText("/home/test",
                   file_name_suffix=".json", num_shards=1 , append_trailing_newlines = True))

 with beam.Pipeline(options=PipelineOptions()) as p:
lines = (p  
            | 'read_data' >> beam.Create(['test-00000-of-00001.json'])
            | "toJson" >> beam.Map(toJson)
            | "takeItems" >> beam.FlatMap(lambda line: line["Items"])
            | "takeSubjects" >> beam.FlatMap(lambda line: line['data']['subjects'])
            | beam.combiners.Count.PerElement()
            | beam.io.WriteToText("/home/items",
                   file_name_suffix=".txt", num_shards=1 , append_trailing_newlines = True))

這兩個管道適用於單個文件，但我有數百個相同格式的文件，並且想利用並行計算的優勢。

有沒有辦法讓這個管道適用於同一目錄下的多個文件？

是否可以在單個 pipe 中執行此操作，而不是創建兩個不同的管道？ （將文件從存儲桶寫入工作節點並不方便。）

非常感謝！

Answer 1

我解決了如何使其適用於多個文件，但無法使其在單個管道中運行。 我使用了 for 循環，然后使用了 beam.Flatten 選項。

這是我的解決方案：

file_list = ["gs://my_bucket/file*.txt.gz"]
res_list = ["/home/subject_test_{}-00000-of-00001.json".format(i) for i in range(len(file_list))]

with beam.Pipeline(options=PipelineOptions()) as p:
    for i,file in enumerate(file_list):
       (p 
        | "Read Text {}".format(i) >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
        | "Write TExt {}".format(i) >> beam.io.WriteToText("/home/subject_test_{}".format(i),
                   file_name_suffix=".json", num_shards=1 , append_trailing_newlines = True))

pcols = []
with beam.Pipeline(options=PipelineOptions()) as p:
   for i,res in enumerate(res_list):
         pcol = (p   | 'read_data_{}'.format(i) >> beam.Create([res])
            | "toJson_{}".format(i) >> beam.Map(toJson)
            | "takeItems_{}".format(i) >> beam.FlatMap(lambda line: line["Items"])
            | "takeSubjects_{}".format(i) >> beam.FlatMap(lambda line: line['data']['subjects']))
        pcols.append(pcol)
   out = (pcols
    | beam.Flatten()
    | beam.combiners.Count.PerElement()
    | beam.io.WriteToText("/home/items",
                   file_name_suffix=".txt", num_shards=1 , append_trailing_newlines = True))

如何從 GCP 存儲桶中讀取 Apache Beam 中的多個文件

問題描述

1 個解決方案

解決方案1
2 已采納 2019-11-10 22:47:26

如何從 GCP 存儲桶中讀取 Apache Beam 中的多個文件

問題描述

1 個解決方案

解決方案1 2 已采納 2019-11-10 22:47:26

解決方案1
2 已采納 2019-11-10 22:47:26