简体   繁体   中英

Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). The data needs some transformation, and so it can't be a straight copy from S3. I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift?

Sure thing, totally possible.

Scala code to read parquet (taken from here )

val people: RDD[Person] = ... 
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame

Scala code to write to redshift (taken from here )

parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM