简体   繁体   English

你能使用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗?

[英]Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). 我们在S3存储了大量的服务器数据(很快就会以Parquet格式存储)。 The data needs some transformation, and so it can't be a straight copy from S3. 数据需要一些转换,因此它不能是S3的直接副本。 I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift? 我将使用Spark来访问数据,但我想知道是不是用Spark操纵它,写回S3,然后复制到Redshift如果我可以跳过一步并运行查询来拉/变换数据然后直接复制到Redshift?

Sure thing, totally possible. 当然,完全有可能。

Scala code to read parquet (taken from here ) Scala代码读取镶木地板(取自此处

val people: RDD[Person] = ... 
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame

Scala code to write to redshift (taken from here ) 用于写入redshift的Scala代码(取自此处

parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM