将PySpark连接到AWS Redshift时出错

Question

Have been trying to connect Spark 2.2.1 on my EMR 5.11.0 cluster to our Redshift store. 一直试图将我的EMR 5.11.0集群上的Spark 2.2.1连接到我们的Redshift存储。

The approaches I followed was - 我遵循的方法是-

Use the inbuilt Redshift JDBC 使用内置的Redshift JDBC

 pyspark --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar from pyspark.sql import SQLContext sc sql_context = SQLContext(sc) redshift_url = "jdbc:redshift://HOST:PORT/DATABASE?user=USER&password=PASSWORD" redshift_query = "select * from table;" redshift_query_tempdir_storage = "s3://personal_warehouse/wip_dumps/" # Read data from a query df_users = sql_context.read \\ .format("com.databricks.spark.redshift") \\ .option("url", redshift_url) \\ .option("query", redshift_query) \\ .option("tempdir", redshift_query_tempdir_storage) \\ .option("forward_spark_s3_credentials", "true") \\ .load()

This gives me the following error - 这给了我以下错误-

Traceback (most recent call last): File "", line 7, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 165, in load return self._df(self._jreader.load()) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, kw) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value ***py4j.protocol.Py4JJavaError: An error occurred while calling o63.load. 追溯（最近一次通话最近）：负载返回self._df（self._jreader）中的文件“ /usr/lib/spark/python/pyspark/sql/readwriter.py”中的文件165行中的行7。 load（））调用文件“ / usr / lib / spark / python /”中的文件“ /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”第1133行pyspark / sql / utils.py”，第63行，在deco中返回f（* a， kw）文件“ /usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py ”，第319行，位于get_return_value *** py4j.protocol.Py4JJavaError：调用o63.load时发生错误。 : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.redshift. ：java.lang.ClassNotFoundException：无法找到数据源：com.databricks.spark.redshift。 Please find packages at http://spark.apache.org/third-party-projects.html at* org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:546) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:302) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke( 请在 org.apache的http://spark.apache.org/third-party-projects.html* org.apache.spark.sql.execution.datasources.DataSource $ .lookupDataSource（DataSource.scala：546）中找到软件包。 org.apache.spark.sql.execution.datasources.DataSource.providingClass（DataSource.scala：87）上的.spark.sql.execution.datasources.DataSource.providingClass $ lzycompute（DataSource.scala：87）在org.apache.spark上org.apache.spark.sql.DataFrameReader.load（DataFrameReader.scala：178）上的.sql.execution.datasources.DataSource.resolveRelation（DataSource.scala：302）在org.apache.spark.sql.DataFrameReader.load（DataFrameReader .scala：146），位于java.sun.reflect.NativeMethodAccessorImpl.invoke0（本机方法），位于sun.reflect.DelegatingMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:62）（java.43） py4j.reflection.MethodInvoker.invoke（MethodInvoker.java:244）处py4j.reflection.ReflectionEngine.invoke（lang.reflect.Method.invoke（Method.java:498） ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.redshift.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:530) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:530) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:530) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.ap 在py4j.ReflectionEngine.java:357）在py4j.Gateway.invoke（Gateway.java:280）在py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:132）在py4j.commands.CallCommand.execute（CallCommand.java:79）在py4j.GatewayConnection.run（GatewayConnection.java:214）在java.lang.Thread.run（Thread.java:748）原因：java.lang.ClassNotFoundException：com.databricks.spark.redshift.DefaultSource在java.net。 org.apache.spark.sql.execution上的java.lang.ClassLoader.loadClass（ClassLoader.java:424）处的URLClassLoader.findClass（URLClassLoader.java:381） .datasources.DataSource $$ anonfun $ 22 $$ anonfun $ apply $ 14.apply（DataSource.scala：530）在org.apache.spark.sql.execution.datasources.DataSource $$ anonfun $ 22 $$ anonfun $ apply $ 14.apply（ DataSource.scala：530）位于scala.util.Try $ .apply（Try.scala：192）位于org.apache.spark.sql.execution.datasources.DataSource $$ anonfun $ 22.apply（DataSource.scala：530） org.apache.spark.sql.execution.datasources.DataSource $$ anonfun $ 22.ap ply(DataSource.scala:530) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:530) ... 16 more org.apache.spark.sql.execution.datasources.DataSource $ .lookupDataSource（DataSource.scala：530）的scala.util.Try.orElse（Try.scala：84）的ply（DataSource.scala：530）...另外16个

Can someone please help tell where I've missed out on something / made a stupid mistake? 有人可以帮我说说我错过了什么/犯了一个愚蠢的错误吗？

Thanks! 谢谢！

Answer 1

You need to add the Spark Redshift datasource to your pyspark command: 您需要将Spark Redshift数据源添加到pyspark命令：

pyspark --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar \
        --packages com.databricks:spark-redshift_2.11:2.0.1

Answer 2

I had to include 4 jar files in the EMR spark-submit options to get this working. 我必须在EMR spark-submit选项中包含4个jar文件才能正常工作。

List of jar files: jar文件列表：

1. RedshiftJDBC41-1.2.12.1017.jar 1. RedshiftJDBC41-1.2.12.1017.jar

2. spark-redshift_2.10-2.0.0.jar 2. spark-redshift_2.10-2.0.0.jar

3. minimal-json-0.9.4.jar 3. minimal-json-0.9.4.jar

4. spark-avro_2.11-3.0.0.jar 4. spark-avro_2.11-3.0.0.jar

You can download the jar files and store them on a S3 bucket and point to it in the spark-submit options like : 您可以下载jar文件并将其存储在S3存储桶中，然后在spark-submit选项中将其指向：

--jars s3://<pathToJarFile>/RedshiftJDBC41-1.2.10.1009.jar,s3://<pathToJarFile>/minimal-json-0.9.4.jar,s3://<pathToJarFile>/spark-avro_2.11-3.0.0.jar,s3://<pathToJarFile>/spark-redshift_2.10-2.0.0.jar

And then finally query your redshift like in this example : spark-redshift-example in your spark code. 然后最终像下面的示例一样查询您的redshift：spark代码中的spark-redshift-example 。

Answer 3

The problem is that spark is not finding the necessary packages in the moment to execute it. 问题在于spark无法在执行时立即找到必要的包。 To do this at the time of executing the script .sh that launches the execution of the python file you have to add not only the driver but also the necessary package. 为此，在执行启动Python文件执行的脚本.sh时，您不仅需要添加驱动程序，还必须添加必要的软件包。

script test.sh 脚本test.sh

sudo pip install boto3

spark-submit --jars RedshiftJDBC42-1.2.15.1025.jar --packages com.databricks:spark-redshift_2.11:2.0.1 test.py

script test.py 脚本test.py

from pyspark.sql import SQLContext
sc
sql_context = SQLContext(sc)

redshift_url = "jdbc:redshift://HOST:PORT/DATABASE?user=USER&password=PASSWORD"

redshift_query  = "select * from table;"

redshift_query_tempdir_storage = "s3://personal_warehouse/wip_dumps/" 



 # Read data from a query



df_users = sql_context.read \
    .format("com.databricks.spark.redshift") \
    .option("url", redshift_url) \
    .option("query", redshift_query) \
    .option("tempdir", redshift_query_tempdir_storage) \
    .option("forward_spark_s3_credentials", "true") \
    .load()

Run the script test.sh 运行脚本test.sh

sudo sh test.sh 须藤sh test.sh

The problem must be solved now. 现在必须解决问题。

将PySpark连接到AWS Redshift时出错

问题描述

3 个解决方案

解决方案1
1 2018-01-15 10:48:32

解决方案2
1 2018-03-23 06:10:37

解决方案3
0 2018-07-25 09:58:50

script test.sh 脚本test.sh

script test.py 脚本test.py

Run the script test.sh 运行脚本test.sh

将PySpark连接到AWS Redshift时出错

问题描述

3 个解决方案

解决方案1 1 2018-01-15 10:48:32

解决方案2 1 2018-03-23 06:10:37

解决方案3 0 2018-07-25 09:58:50

script test.sh 脚本test.sh

script test.py 脚本test.py

Run the script test.sh 运行脚本test.sh

解决方案1
1 2018-01-15 10:48:32

解决方案2
1 2018-03-23 06:10:37

解决方案3
0 2018-07-25 09:58:50