[英]How to use Spark packages in AWS Glue?
I'd like to use Datastax's spark-cassandra-connector in AWS Glue.我想在 AWS Glue 中使用 Datastax 的spark-cassandra-connector 。 If I run pyspark locally, my command would look like如果我在本地运行 pyspark,我的命令看起来像
path/to/spark-3.0.1-bin-hadoop2.7/bin/spark-submit \
--conf spark.cassandra.connection.host=XXX \
--conf spark.cassandra.auth.username=XXX \
--conf spark.cassandra.auth.password=XXX \
--packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.1 \
~/my_script.py
How to run this script in Glue?如何在 Glue 中运行此脚本?
Things I've tried我尝试过的事情
How to import Spark packages in AWS Glue? 如何在 AWS Glue 中导入 Spark 包? It looks similar to my question.它看起来与我的问题相似。 The accepted answer talks about adding a zipped python module as parameter.接受的答案谈到添加一个压缩的 python 模块作为参数。 But the spark-cassandra-connector
isn't a python module.但是spark-cassandra-connector
不是 python 模块。
(according to @alex's comment) put the SCC assembly in the Glue job's Jar lib path
(根据@alex 的评论)将 SCC 程序集放在 Glue 作业的Jar lib path
Error:错误:
File "/tmp/delta_on_s3_spark.py", line 75, in _write_df_to_cassandra
df.write.format(format_).mode('append').options(table=table, keyspace=keyspace).save()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 732, in save
self._jwrite.save()
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o84.save.
: java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
at com.datastax.spark.connector.TableRef.<init>(TableRef.scala:4)
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:83)
......
spark.jars.packages = com.datastax.spark:spark-cassandra-connector_2.12:2.5.1
in the Glue job's job parameter
(根据spark.jars.packages = com.datastax.spark:spark-cassandra-connector_2.12:2.5.1
的评论)将spark.jars.packages = com.datastax.spark:spark-cassandra-connector_2.12:2.5.1
放在 Glue 作业的job parameter
Error:错误:
File "/tmp/delta_on_s3_spark.py", line 75, in _write_df_to_cassandra
df.write.format(format_).mode('append').options(table=table, keyspace=keyspace).save()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 732, in save
self._jwrite.save()
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
......
The recommended way is to use --packages
or --conf spark.jars.packages
with Maven coordinates, so Spark will correctly pull all necessary dependencies that are used by Spark Cassandra Connector (Java driver, etc.) - if you use --jars
with only SCC jar, then your job will fail.推荐的方法是使用--packages
或--conf spark.jars.packages
与 Maven 坐标,因此 Spark 将正确拉出 Spark Cassandra 连接器(Java 驱动程序等)使用的所有必要依赖项 - 如果您使用--jars
只有 SCC jar 的--jars
,那么你的工作将会失败。
Starting with SCC 2.5.1, there is also a new artifact - spark-cassandra-connector-assembly that includes all necessary dependencies.从 SCC 2.5.1 开始,还有一个新工件 - spark-cassandra-connector-assembly ,其中包含所有必要的依赖项。 With it you can avoid problems with conflicting dependencies, plus you can use it with --jars
or with Glue job's Jar lib path.有了它,您可以避免依赖冲突的问题,此外您还可以将它与--jars
或 Glue 作业的 Jar lib 路径一起使用。
PS With Spark 3.0 it's recommended to use SCC 3.0.0-beta, because of significant changes in internals of Spark SQL. PS 对于 Spark 3.0,建议使用 SCC 3.0.0-beta,因为 Spark SQL 的内部结构发生了重大变化。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.