Which apache-beam feature to use to just read a function as first in the pipeline and take the output

Question

I'm trying to create a dataflow pipeline and deploy it in Cloud environment. I have a code like below, which is trying to read files from a specific folder within a GCS bucket:

def read_from_bucket():
    file_paths=[]
    client = storage.Client()
    bucket = client.bucket("bucket_name")
    blobs_specific = list(bucket.list_blobs(prefix="test_folder/"))
    for blob in blobs_specific: 
        file_paths.append(blob)       
    return file_paths

The above function returns a list of file paths present in that GCS folder. Now this list is sent to the below code, which will filter the files as per their extension and store it in respective folders in the GCS bucket.

def write_to_bucket():
  
    client = storage.Client()  

    for blob in client.list_blobs("dataplus-temp"):
        source_bucket = client.bucket("source_bucket")
        source_blob = source_bucket.blob(blob.name) 
    
        file_extension = pathlib.Path(blob.name).suffix      
        
        if file_extension == ".json":       
               
                destination_bucket=client.bucket("destination_bucket")
                new_blob = source_bucket.copy_blob(source_blob,destination_bucket,'source_bucket/JSON/{}'.format(source_blob))                         
        
        elif file_extension == ".txt":
           
                destination_bucket=client.bucket("destination_bucket")
                new_blob = source_bucket.copy_blob(source_blob,destination_bucket,'Text/{}'.format(source_blob))

I have to execute the above implementation using dataflow pipeline, in such a way that the file path has go to dataflow pipeline and it should get stored in the respective folders. I created a dataflow pipeline like below, but not sure whether I used right Ptrasformations.

pipe= beam.Pipeline()( 
    pipe
    |"Read data from bucket" >> beam.ParDo(read_from_bucket)
    |"Write files to folders" >> beam.ParDo(write_to_bucket)
)

pipe.run()

executed like:

python .\\filename.py 
  --region asia-east1 
  --runner DataflowRunner 
  --project proejct_name 
  --temp_location gs://bucket_name/tmp 
  --job_name test

I'm getting below error after execution:

return inputs[0].windowing
AttributeError: 'PBegin' object has no attribute 'windowing'

I have checked apache-beam documentation, but somehow unable to understand. I'm new to apache-beam, just started as a beginner, please consider if this question is silly.

Kindly help me in how to solve this.

Answer 1

To solve your issue you have to use existing Beam IO to read and write to Cloud Storage , example:

import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText

def run():
  pipeline_options = PipelineOptions()

  with beam.Pipeline(options=pipeline_options) as p:

    (
      p
      | "Read data from bucket" >> ReadFromText("gs://your_input_bucket/your_folder/*")
      | "Write files to folders" >> WriteToText("gs://your_dest_bucket")
    )

if __name__ == "__main__":
   logging.getLogger().setLevel(logging.INFO)
   run()

Which apache-beam feature to use to just read a function as first in the pipeline and take the output

Question

1 answers

solution1
0 2023-01-10 21:13:26

Which apache-beam feature to use to just read a function as first in the pipeline and take the output

Question

1 answers

solution1 0 2023-01-10 21:13:26

solution1
0 2023-01-10 21:13:26