Problem loading csv into DataFrame in PySpark

Question

I'm trying to aggregate a bunch of CSV files into one and output it to S3 in ORC format using an ETL job in AWS Glue. My aggregated CSV looks like this:

header1,header2,header3
foo1,foo2,foo3
bar1,bar2,bar3

I have a string representation of the aggregated CSV called aggregated_csv has content header1,header2,header3\\nfoo1,foo2,foo3\\nbar1,bar2,bar3 . I've read that pyspark has a straightforward way to convert CSV files into DataFrames (which I need so that I can leverage Glue's ability to easily output in ORC). Here is a snippet of what I've tried:

def f(glueContext, aggregated_csv, schema):
    with open('somefile', 'a+') as agg_file:
        agg_file.write(aggregated_csv)
        #agg_file.seek(0)
        df = glueContext.read.csv(agg_file, schema=schema, header="true")
        df.show()

I've tried it both with and without seek. When I don't call seek(), the job completes successfully but df.show() doesn't display any data other than the headers. When I do call seek(), I get the following exception:

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-48-255.us-west-2.compute.internal:8020/user/root/header1,header2,header3\n;'

Since seek seems to change the behavior and since the headers in my csv are part of the exception string, I'm assuming that the problem is somehow related to where the file cursor is when I pass the file to glueContext.read.csv() but I'm not sure how to resolve it. If I uncomment the seek(0) call and add an agg_file.read() command, I can see the entire contents of the file as expected. What do I need to change so that I'm able to successfully read a csv file that I've just written into a spark dataframe?

Answer 1

I think you're passing wrong arguments to csv function. I believe that GlueContext.read.csv() will get an instance of DataFrameReader.csv() , and it's signature takes a filename as a first argument, and what you're passing is a file-like object.

def f(glueContext, aggregated_csv, schema):
    with open('somefile', 'a+') as agg_file:
        agg_file.write(aggregated_csv)
        #agg_file.seek(0)
    df = glueContext.read.csv('somefile', schema=schema, header="true")
    df.show()

BUT, if all that you want it to write an ORC file, and you already have the data read as aggregated_csv you can create a DataFrame directly out of a list of tuples.

df = spark.createDataFrame([('foo1','foo2','foo3'), ('bar1','bar2','bar3')], ['header1', 'header2', 'header3'])

then, if you need a Glue DynamicFrame use fromDF function

dynF = fromDF(df, glueContext, 'myFrame')

ONE MORE BUT: you don't need glue to write ORC - spark it totally capable of it. Just use DataFrameWriter.orc() function:

df.write.orc('s3://path')

Problem loading csv into DataFrame in PySpark

Question

1 answers

solution1
2 ACCPTED 2018-09-18 07:06:33

Problem loading csv into DataFrame in PySpark

Question

1 answers

solution1 2 ACCPTED 2018-09-18 07:06:33

solution1
2 ACCPTED 2018-09-18 07:06:33