简体   繁体   中英

Using PANDAS with Apache Beam

I am new to Apache Beam and just started working on it with Python SDK. Regarding Apache beam I know high level of Pipelines, Pcollections, Ptransforms, ParDo and DoFn.

In my current project pipeline has been implemented using PANDAS to read, transform and write file using below mentioned syntax

I wanted to understand if this is correct implementation of Apache Beam as we are directly reading and writing files using PANDAS only and not processing the files element by element.

steps:

  1. create Pipeline
  2. create pcollection of input file path
  3. Call DoFn and pass the file path
  4. Do everything inside DoFn (read, transform and write) using PANDAS.

sample high level code:

import **required libraries

class ActionClass(beam.DoFn):

    def process(self, file_path):
        #reading file using PANDAS into dataframe 
        df = pandas.read_csv('file_path')
        # do some transformation using pandas
        #write dataframe to output file from inside DoFn only.
        return

def run():

    p = beam.Pipeline(options=options)

    input = p | beam.io.ReadFromText('input_file_path') --reading only file path

    output = input | 'PTransform' | beam.ParDo(ActionClass)

In my opinion, if you have a high number of small CSV files that you want to process with pandas, then this is probably a valid use case with Apache Beam.

Thanks

My opinion is that you are not using the power of beam.

because with your solution you do not take the parallel process that beam is really useful for.

I suggest you to read the CSV using the ReadFromText and use Map or ParDo to do the transformation on the data In this case the Beam will read the CSV and can distribute the data through different workers that you do the transformation.

and now depending on what you are trying you can use the dataframe directly on Beamhttps://beam.apache.org/documentation/dsls/dataframes/overview/

  from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
  df = p | read_csv("gs://apache-beam-samples/nyc_taxi/misc/sample.csv")
  agg = df[['passenger_count', 'DOLocationID']].groupby('DOLocationID').sum()
  agg.to_csv('output')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM