Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

Question

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?

Answer 1

Sure thing, totally possible.

Scala code to read parquet (taken from here )

val people: RDD[Person] = ... 
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame

Scala code to write to redshift (taken from here )

parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()

Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

Question

1 answers

solution1
11 ACCPTED 2016-04-14 22:52:39

Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

Question

1 answers

solution1 11 ACCPTED 2016-04-14 22:52:39

solution1
11 ACCPTED 2016-04-14 22:52:39