简体   繁体   English

使用 Databricks(和 Apache Spark)从 AWS Redshift 读取

[英]Read from AWS Redshift using Databricks (and Apache Spark)

I'm trying to execute SQL SELECT command using Databricks with AWS Redshift.我正在尝试使用 Databricks 和 AWS Redshift 执行 SQL SELECT 命令。

I went through https://github.com/databricks/spark-redshift README and configure:我浏览了https://github.com/databricks/spark-redshift README 并配置:

  • Spark driver to Redshift - I'm passing user and password options Spark 驱动程序到 Redshift - 我正在传递userpassword选项
  • Spark to S3 - I've mount AWS S3 using dbfs mount. Spark 到 S3 - 我已经使用dbfs挂载方式挂载 AWS S3。
  • Redshift to S3 - I'm passing temporary_aws_access_key_id , temporary_aws_secret_access_key , temporary_aws_session_token Redshift 到 S3 - 我正在传递temporary_aws_access_key_idtemporary_aws_secret_access_keytemporary_aws_session_token

NOTE This is a kind of Proof of Concept so I'm ignoring all security details like encryptions.注意这是一种概念证明,因此我忽略了所有安全细节,如加密。

Below configuration I used in my Databricks Notebook:下面是我在 Databricks Notebook 中使用的配置:

%python

# Read data from a table
df = spark.read \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:postgresql://<REDSHIFT_URL>:<DB_PORT>/<DB_NAME>") \
  .option("temporary_aws_access_key_id", "XXX") \
  .option("temporary_aws_secret_access_key","XXX") \
  .option("temporary_aws_session_token", "XXX") \
  .option("user", "XXX") \
  .option("password", "XXX") \
  .option("tempdir", "dbfs:/mnt/result_bucket/...") \
  .option("query", "SELECT * FROM users") \
  .load()

# display(df) #SQL Exception

Result:结果:

结果数据框

But when I would uncomment the last line and try to see SQL SELECT results:但是当我取消注释最后一行并尝试查看 SQL SELECT 结果时:

java.sql.SQLException: Exception thrown in awaitResult: 
    at com.databricks.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:223)
    at com.databricks.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:197)
    at com.databricks.spark.redshift.RedshiftRelation.$anonfun$getRDDFromS3$1(RedshiftRelation.scala:212)
    at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
    at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:377)
    at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:363)
    at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)
    at com.databricks.spark.redshift.RedshiftRelation.getRDDFromS3(RedshiftRelation.scala:212)
    at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:157)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$3(DataSourceStrategy.scala:426)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:460)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:538)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:459)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:426)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:69)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:69)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:100)
    at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:75)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$4(QueryPlanner.scala:85)
    at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
    at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:162)
    at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:160)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1429)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:82)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:100)
    at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:75)
    at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:493)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:129)
    at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:134)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:180)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:180)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:129)
    at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:122)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:141)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:141)
    at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:136)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$simpleString$2(QueryExecution.scala:199)
    at org.apache.spark.sql.execution.ExplainUtils$.processPlan(ExplainUtils.scala:115)
    at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:199)
    at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:260)
    at org.apache.spark.sql.execution.QueryExecution.explainStringLocal(QueryExecution.scala:226)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:123)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:273)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:104)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:223)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3823)
    at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3031)
    at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:268)
    at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:102)
    at com.databricks.backend.daemon.driver.PythonDriverLocalBase.generateTableResult(PythonDriverLocalBase.scala:526)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.computeListResultsItem(PythonDriverLocal.scala:672)
    at com.databricks.backend.daemon.driver.PythonDriverLocalBase.genListResults(PythonDriverLocalBase.scala:490)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.$anonfun$getResultBufferInternal$1(PythonDriverLocal.scala:727)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:608)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.getResultBufferInternal(PythonDriverLocal.scala:687)
    at com.databricks.backend.daemon.driver.DriverLocal.getResultBuffer(DriverLocal.scala:634)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.outputSuccess(PythonDriverLocal.scala:650)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.$anonfun$repl$6(PythonDriverLocal.scala:221)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:608)
    at com.databricks.backend.daemon.driver.PythonDriverLocal.repl(PythonDriverLocal.scala:208)
    at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:526)
    at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
    at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:50)
    at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
    at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
    at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:50)
    at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:503)
    at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689)
    at scala.util.Try$.apply(Try.scala:213)
    at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681)
    at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522)
    at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370)
    at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.postgresql.util.PSQLException: ERROR: UNLOAD destination is not supported. (Hint: only S3 based unload is allowed)
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2477)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2190)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:300)
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428)
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354)
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:169)
    at org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:158)
    at com.databricks.spark.redshift.JDBCWrapper.$anonfun$executeInterruptibly$1(RedshiftJDBCWrapper.scala:197)
    at com.databricks.spark.redshift.JDBCWrapper.$anonfun$executeInterruptibly$1$adapted(RedshiftJDBCWrapper.scala:197)
    at com.databricks.spark.redshift.JDBCWrapper.$anonfun$executeInterruptibly$2(RedshiftJDBCWrapper.scala:215)
    at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    at scala.util.Success.$anonfun$map$1(Try.scala:255)
    at scala.util.Success.map(Try.scala:213)
    at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

More details:更多细节:

Databricks Runtime Version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) Databricks 运行时版本: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)

I've tried the same with JDBC redshift Driver (using URL prefix jdbc:redshift ) Then I had to install com.github.databricks:spark-redshift_2.11:master-SNAPSHOT to my Databricks Cluster Libraries.我已经尝试过与 JDBC redshift 驱动程序相同(使用 URL 前缀jdbc:redshift )然后我必须将com.github.databricks:spark-redshift_2.11:master-SNAPSHOT安装到我的数据库集群中 Data.b The result was the same.结果是一样的。

Data inside Redshift (sample data created by AWS): Redshift 中的数据(AWS 创建的示例数据): AWS 红移数据

Does anyone have an idea what is wrong with my configuration?有谁知道我的配置有什么问题吗?

After several tries, I figure out the solution.经过几次尝试,我找到了解决方案。

  • I deleted temporary keys我删除了临时密钥
  • I used forward_spark_s3_credentials我使用forward_spark_s3_credentials
  • I attached IAM Role to the EC2's (cluster)我将 IAM 角色附加到 EC2 的(集群)
  • I used s3a path instead mounted dbfs directory我使用s3a路径而不是挂载dbfs目录
  • Update the cluster's libraries:更新集群的库:
    • I used RedshiftJDBC42_no_awssdk_1_2_55_1083.jar我使用RedshiftJDBC42_no_awssdk_1_2_55_1083.jar
    • and deleted com.github.databricks:spark-redshift_2.11:master-SNAPSHOT .并删除com.github.databricks:spark-redshift_2.11:master-SNAPSHOT

Final configuration:最终配置:

df = spark.read \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://<REDSHIFT_URL>:<DB_PORT>/<DB_NAME>") \
  .option("forward_spark_s3_credentials", "true") \
  .option("user", "XXX") \
  .option("password", "XXX") \
  .option("tempdir", "s3a://<MY_S3_BUCKET>/...") \
  .option("query", "SELECT userid, username FROM users") \
  .load()

display(df)

Libraries set up in the cluster: (Probably only Redshift JDBC driver is needed. I also added libs from AWS bundle (which can be found here在集群中设置的库:(可能只需要 Redshift JDBC 驱动程序。我还从 AWS 包中添加了库(可以在此处找到

Databrisck集群配置

The final code will be on my GitHub .最终代码将在我的 GitHub 上

one of the errors could be in option('url').其中一个错误可能在选项('url')中。 it should be jdbc:redshift not jdbc:postgresql, use redshift and re-try它应该是 jdbc:redshift 而不是 jdbc:postgresql,使用 redshift 并重试

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Databricks 上的 Apache Spark 将文件写入 delta lake 会产生与读取 Data Frame 不同的结果 - Writing out file to delta lake produces different results from Data Frame read using Apache Spark on Databricks 无法使用 Apache Spark 在 AWS Glue 中读取 json 个文件 - Unable to read json files in AWS Glue using Apache Spark 如何使用 Databricks 的 Apache Spark 从 SQL 表中获取 stream 数据 - How to stream data from SQL Table with Apache Spark with Databricks 使用 Apache Spark 在 Databricks 中使用 SQL 查询进行 CASTING 问题 - CASTING issue with SQL query in Databricks with Apache Spark 如何从 Databricks 中的 JSON 或字典或键值对格式创建 Apache Spark DataFrame - How to Create an Apache Spark DataFrame from JSON or Dictonary or Key Value pairs format in Databricks 无法使用 PySpark 与 Databricks 上的 apache spark function to_timestamp() 连接并添加一列 - Unable to concatenate with apache spark function to_timestamp() on Databricks using PySpark and add a column 使用 AWS DMS 的 MongoDB 中没有数据出现在 Redshift 中 - No Data appearing in Redshift from MongoDB using AWS DMS 为 AWS EKS 配置 Apache Spark - Configure Apache Spark for AWS EKS Apache Spark function to_timestamp() 无法在 Databricks 上使用 PySpark - Apache Spark function to_timestamp() not working with PySpark on Databricks 如何使用 Databricks 在 Apache Spark 上编译 PySpark 中的 While 循环语句 - How to Compile a While Loop statement in PySpark on Apache Spark with Databricks
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM