简体   繁体   中英

Python/Apache-Beam: How to Parse Text File To CSV?

I'm still new to Beam, but how exactly do you Read From CSV Files that are in GCS Buckets? I essentially what to transform these files into a pandas dataframe using Beam and then apply an sklearn model to "train" this data. Most of the examples I've seen pre-define the header, I want this Beam pipeline to generalize to any files where the headers will definitely be different. There's a library called beam_utils that does what I want to do, but then I run into this error: AttributeError: module 'apache_beam.io.fileio' has no attribute 'CompressionTypes'

Code Example:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

# The error occurs in this import
from beam_utils.sources import CsvFileSource

options = {
    'project': 'my-project',
    'runner:': 'DirectRunner',
    'streaming': False
}

pipeline_options = PipelineOptions(flags=[], **options)

class Printer(beam.DoFn):
    def process(self, element):
        print(element)

with beam.Pipeline(options=pipeline_options) as p:  # Create the Pipeline with the specified options.

    data = (p
            | 'Read File From GCS' >> beam.io.textio.ReadFromText('gs://my-csv-files')
            )

    _ = (data | "Print the data" >> beam.ParDo(Printer()))

result = p.run()
result.wait_until_finish()

The Apache Beam module fileio has being recently modified with backward incompatible changes, and the library beam_utils hasn't been updated yet.

I went through the question suggested by @Pablo and the source code of beam_utils (also written by Pablo) to replicate the behavior using the filesystems module.

Below are two versions of the code using pandas to generate the DataFrame(s).

csv used for the example:

a,b
1,2
3,4
5,6

Reading the csv and creating the DataFrame with all its content

import apache_beam as beam
import pandas as pd
import csv
import io

def create_dataframe(readable_file):

    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Read it as csv, you can also use csv.reader
    csv_dict = csv.DictReader(io.TextIOWrapper(gcs_file))

    # Create the DataFrame
    dataFrame = pd.DataFrame(csv_dict)
    print(dataFrame.to_string())

p = beam.Pipeline()
(p | beam.Create(['gs://my-bucket/my-file.csv'])
   | beam.FlatMap(create_dataframe)
)

p.run()

Resulting DataFrame

   a  b
0  1  2
1  3  4
2  5  6

Reading the csv and creating the DataFrames in other transformation

def get_csv_reader(readable_file):

    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Return the csv reader
    return  csv.DictReader(io.TextIOWrapper(gcs_file))

p = beam.Pipeline()
(p | beam.Create(['gs://my-bucket/my-file.csv'])
   | beam.FlatMap(get_csv_reader)
   | beam.Map(lambda x: pd.DataFrame([x])) # Create the DataFrame from each csv row
   | beam.Map(lambda x: print(x.to_string()))
)

Resulting DataFrames

   a  b
0  1  2
   a  b
0  3  4
   a  b
0  5  6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM