简体   繁体   中英

Efficient way to read a CSV in apache beam python

After reading some questions on StackOverflow, I have been using the below code to read CSV files on beam.

Pipeline code:

 with beam.Pipeline(options=pipeline_options) as p:

    parsed_csv = (p | 'Create from CSV' >> beam.Create([input_file]))
    flattened_file = (parsed_csv | 'Flatten the CSV' >> beam.FlatMap(get_csv_reader))

Method to read csv : get_csv_reader()

def get_csv_reader(readable_file):

    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Read file as a CSV
    gcs_reader = csv.reader(io.TextIOWrapper(gcs_file))

    next(gcs_reader)

    return gcs_reader

I am using this as opposed to ReadFromText because it fails when there are newline characters in the field values.

Question : Now, my question is if this way of reading CSV is efficient? Would it fail in case of huge files? I ask because I am using csv.reader in my method. I feel like this loads the file into memory causing a failure for huge files. Please correct my understanding if I am wrong.

Additionally, since this is a Ptransform will my method be serialized to run on different worker nodes? I am confused as to how beam would run this code behind the scenes.

If this is not the efficient please suggest the efficient way to read CSV on apache beam.

You can define a generator to lazily read the files row by row.

def read_csv_file(readable_file):
  with beam.io.filesystems.FileSystems.open(readable_file) as gcs_file:
    for row in csv.reader(gcs_file):
      yield row

A similar question is How to handle newlines when loading a CSV into Apache Beam?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM