Using PANDAS with Apache Beam

Question

I am new to Apache Beam and just started working on it with Python SDK. Regarding Apache beam I know high level of Pipelines, Pcollections, Ptransforms, ParDo and DoFn.

In my current project pipeline has been implemented using PANDAS to read, transform and write file using below mentioned syntax

I wanted to understand if this is correct implementation of Apache Beam as we are directly reading and writing files using PANDAS only and not processing the files element by element.

steps:

create Pipeline
create pcollection of input file path
Call DoFn and pass the file path
Do everything inside DoFn (read, transform and write) using PANDAS.

sample high level code:

import **required libraries

class ActionClass(beam.DoFn):

    def process(self, file_path):
        #reading file using PANDAS into dataframe 
        df = pandas.read_csv('file_path')
        # do some transformation using pandas
        #write dataframe to output file from inside DoFn only.
        return

def run():

    p = beam.Pipeline(options=options)

    input = p | beam.io.ReadFromText('input_file_path') --reading only file path

    output = input | 'PTransform' | beam.ParDo(ActionClass)

Answer 1

In my opinion, if you have a high number of small CSV files that you want to process with pandas, then this is probably a valid use case with Apache Beam.

Thanks

Answer 2

My opinion is that you are not using the power of beam.

because with your solution you do not take the parallel process that beam is really useful for.

I suggest you to read the CSV using the ReadFromText and use Map or ParDo to do the transformation on the data In this case the Beam will read the CSV and can distribute the data through different workers that you do the transformation.

and now depending on what you are trying you can use the dataframe directly on Beamhttps://beam.apache.org/documentation/dsls/dataframes/overview/

  from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
  df = p | read_csv("gs://apache-beam-samples/nyc_taxi/misc/sample.csv")
  agg = df[['passenger_count', 'DOLocationID']].groupby('DOLocationID').sum()
  agg.to_csv('output')

Using PANDAS with Apache Beam

Question

2 answers

solution1
0 2020-08-30 15:11:25

solution2
0 2021-04-29 19:58:59

Using PANDAS with Apache Beam

Question

2 answers

solution1 0 2020-08-30 15:11:25

solution2 0 2021-04-29 19:58:59

solution1
0 2020-08-30 15:11:25

solution2
0 2021-04-29 19:58:59