If I have huge data in the form of zipped csvs, how can I combine it into a single csv file (zipped output or not doesn't matter)?
I am reading it into a spark Dataframes but then I am stuck on how to concatenate pyspark Dataframes.
Below is my code that runs a loop and wants to append Dataframe for each loop run:
schema=StructType([])
result = spark.createDataFrame(sc.emptyRDD(), schema)
for day in range(1,31):
day_str = str(day) if day>=10 else "0"+str(day)
print 'Ingesting %s' % day_str
df = spark.read.format("csv").option("header", "false").option("delimiter", "|").option("inferSchema", "true").load("s3a://key/201811%s" % (day_str))
result = result.unionAll(df)
result.write.save("s3a://key/my_result.csv", format='csv')
This gives me error AnalysisException: u"Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 1 columns;;\\n'Union\\n:- LogicalRDD\\n+- Relation[_c0#75] csv\\n"
. Could anyone help me how can I proceed?
This worked for me:
result=spark.createDataFrame(sc.emptyRDD(), schema_mw)
for day in range(1,31):
day_str = str(day) if day>=10 else "0"+str(day)
print 'Ingesting %s' % day_str
df = spark.read.format("csv").option("header", "false").option("delimiter", ",").schema(schema_mw).load("s3a://bucket/201811%s" % (day_str))
if result:
result = result.union(df)
else:
result = df
result.repartition(1).write.save("s3a://bucket/key-Compiled", format='csv', header=False)
This works, however, when I try to load header as true in the last step for repartitioning, the header is stored as a row. I am not sure how to add those headers as a header and not as a row though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.