你能使用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗？

Question

We have huge amounts of server data stored in S3 (soon to be in a Parquet format). 我们在S3存储了大量的服务器数据（很快就会以Parquet格式存储）。 The data needs some transformation, and so it can't be a straight copy from S3. 数据需要一些转换，因此它不能是S3的直接副本。 I'll be using Spark to access the data, but I'm wondering if instead of manipulating it with Spark, writing back out to S3, and then copying to Redshift if I can just skip a step and run a query to pull/transform the data and then copy it straight to Redshift? 我将使用Spark来访问数据，但我想知道是不是用Spark操纵它，写回S3，然后复制到Redshift如果我可以跳过一步并运行查询来拉/变换数据然后直接复制到Redshift？

Answer 1

Sure thing, totally possible. 当然，完全有可能。

Scala code to read parquet (taken from here ) Scala代码读取镶木地板（取自此处）

val people: RDD[Person] = ... 
people.write.parquet("people.parquet")
val parquetFile = sqlContext.read.parquet("people.parquet") //data frame

Scala code to write to redshift (taken from here ) 用于写入redshift的Scala代码（取自此处）

parquetFile.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()

你能使用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗？

问题描述

1 个解决方案

解决方案1
11 已采纳 2016-04-14 22:52:39

你能使用Spark SQL / Hive / Presto直接从Parquet / S3复制到Redshift吗？

问题描述

1 个解决方案

解决方案1 11 已采纳 2016-04-14 22:52:39

解决方案1
11 已采纳 2016-04-14 22:52:39