Pyspark: Merge all zipped csvs into one csv in python

Question

If I have huge data in the form of zipped csvs, how can I combine it into a single csv file (zipped output or not doesn't matter)?

I am reading it into a spark Dataframes but then I am stuck on how to concatenate pyspark Dataframes.

Below is my code that runs a loop and wants to append Dataframe for each loop run:

        schema=StructType([])
        result = spark.createDataFrame(sc.emptyRDD(), schema)
        for day in range(1,31):
            day_str = str(day) if day>=10 else "0"+str(day)
            print 'Ingesting %s' % day_str
            df = spark.read.format("csv").option("header", "false").option("delimiter", "|").option("inferSchema", "true").load("s3a://key/201811%s" % (day_str))
            result = result.unionAll(df)

        result.write.save("s3a://key/my_result.csv", format='csv')

This gives me error AnalysisException: u"Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 1 columns;;\\n'Union\\n:- LogicalRDD\\n+- Relation[_c0#75] csv\\n" . Could anyone help me how can I proceed?

Answer 1

This worked for me:

result=spark.createDataFrame(sc.emptyRDD(), schema_mw)

for day in range(1,31):
    day_str = str(day) if day>=10 else "0"+str(day)
    print 'Ingesting %s' % day_str

    df = spark.read.format("csv").option("header", "false").option("delimiter", ",").schema(schema_mw).load("s3a://bucket/201811%s" % (day_str))

    if result:
        result = result.union(df)
    else:
        result = df
result.repartition(1).write.save("s3a://bucket/key-Compiled", format='csv', header=False)

This works, however, when I try to load header as true in the last step for repartitioning, the header is stored as a row. I am not sure how to add those headers as a header and not as a row though.

Pyspark: Merge all zipped csvs into one csv in python

Question

1 answers

solution1
0 ACCPTED 2019-02-22 23:25:56

Pyspark: Merge all zipped csvs into one csv in python

Question

1 answers

solution1 0 ACCPTED 2019-02-22 23:25:56

solution1
0 ACCPTED 2019-02-22 23:25:56