使用哪個 apache-beam 功能來讀取管道中的第一個 function 並獲取 output

Question

我正在嘗試創建一個數據流管道並將其部署在雲環境中。 我有如下代碼，它試圖從 GCS 存儲桶中的特定文件夾中讀取文件：

def read_from_bucket():
    file_paths=[]
    client = storage.Client()
    bucket = client.bucket("bucket_name")
    blobs_specific = list(bucket.list_blobs(prefix="test_folder/"))
    for blob in blobs_specific: 
        file_paths.append(blob)       
    return file_paths

上面的 function 返回該 GCS 文件夾中存在的文件路徑列表。 現在這個列表被發送到下面的代碼，它將根據擴展名過濾文件並將其存儲在 GCS 存儲桶中的相應文件夾中。

def write_to_bucket():
  
    client = storage.Client()  

    for blob in client.list_blobs("dataplus-temp"):
        source_bucket = client.bucket("source_bucket")
        source_blob = source_bucket.blob(blob.name) 
    
        file_extension = pathlib.Path(blob.name).suffix      
        
        if file_extension == ".json":       
               
                destination_bucket=client.bucket("destination_bucket")
                new_blob = source_bucket.copy_blob(source_blob,destination_bucket,'source_bucket/JSON/{}'.format(source_blob))                         
        
        elif file_extension == ".txt":
           
                destination_bucket=client.bucket("destination_bucket")
                new_blob = source_bucket.copy_blob(source_blob,destination_bucket,'Text/{}'.format(source_blob))

我必須使用數據流管道執行上述實現，這樣文件路徑有 go 到數據流管道，它應該存儲在相應的文件夾中。 我創建了一個如下所示的數據流管道，但不確定我是否使用了正確的 Ptrasformations。

pipe= beam.Pipeline()( 
    pipe
    |"Read data from bucket" >> beam.ParDo(read_from_bucket)
    |"Write files to folders" >> beam.ParDo(write_to_bucket)
)

pipe.run()

執行如下：

python .\\filename.py 
  --region asia-east1 
  --runner DataflowRunner 
  --project proejct_name 
  --temp_location gs://bucket_name/tmp 
  --job_name test

執行后出現以下錯誤：

return inputs[0].windowing
AttributeError: 'PBegin' object has no attribute 'windowing'

我已經檢查了 apache-beam 文檔，但不知何故無法理解。 我是 apache-beam 的新手，剛開始作為初學者，請考慮這個問題是否愚蠢。

請幫助我解決這個問題。

Answer 1

要解決您的問題，您必須使用現有的Beam IO來讀取和寫入Cloud Storage ，例如：

import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText

def run():
  pipeline_options = PipelineOptions()

  with beam.Pipeline(options=pipeline_options) as p:

    (
      p
      | "Read data from bucket" >> ReadFromText("gs://your_input_bucket/your_folder/*")
      | "Write files to folders" >> WriteToText("gs://your_dest_bucket")
    )

if __name__ == "__main__":
   logging.getLogger().setLevel(logging.INFO)
   run()

使用哪個 apache-beam 功能來讀取管道中的第一個 function 並獲取 output

問題描述

1 個解決方案

解決方案1
0 2023-01-10 21:13:26

使用哪個 apache-beam 功能來讀取管道中的第一個 function 並獲取 output

問題描述

1 個解決方案

解決方案1 0 2023-01-10 21:13:26

解決方案1
0 2023-01-10 21:13:26