Efficient way to read a CSV in apache beam python

Question

After reading some questions on StackOverflow, I have been using the below code to read CSV files on beam.

Pipeline code:

 with beam.Pipeline(options=pipeline_options) as p:

    parsed_csv = (p | 'Create from CSV' >> beam.Create([input_file]))
    flattened_file = (parsed_csv | 'Flatten the CSV' >> beam.FlatMap(get_csv_reader))

Method to read csv : get_csv_reader()

def get_csv_reader(readable_file):

    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Read file as a CSV
    gcs_reader = csv.reader(io.TextIOWrapper(gcs_file))

    next(gcs_reader)

    return gcs_reader

I am using this as opposed to ReadFromText because it fails when there are newline characters in the field values.

Question : Now, my question is if this way of reading CSV is efficient? Would it fail in case of huge files? I ask because I am using csv.reader in my method. I feel like this loads the file into memory causing a failure for huge files. Please correct my understanding if I am wrong.

Additionally, since this is a Ptransform will my method be serialized to run on different worker nodes? I am confused as to how beam would run this code behind the scenes.

If this is not the efficient please suggest the efficient way to read CSV on apache beam.

Answer 1

You can define a generator to lazily read the files row by row.

def read_csv_file(readable_file):
  with beam.io.filesystems.FileSystems.open(readable_file) as gcs_file:
    for row in csv.reader(gcs_file):
      yield row

A similar question is How to handle newlines when loading a CSV into Apache Beam?

Efficient way to read a CSV in apache beam python

Question

1 answers

solution1
1 2022-06-02 18:31:15

Efficient way to read a CSV in apache beam python

Question

1 answers

solution1 1 2022-06-02 18:31:15

solution1
1 2022-06-02 18:31:15