[英]How to run Apache Spark applications with a JAR dependency from AWS S3?
I have a .jar
file containing useful functions for my application located in an AWS S3 bucket, and I want to use it as a dependency in Spark without having to first download it locally.我有一个
.jar
文件,其中包含对位于 AWS S3 存储桶中的应用程序有用的函数,我想将它用作 Spark 中的依赖项,而不必先在本地下载它。 Is it possible to directly reference the .jar
file with spark-submit
(or pyspark
) --jars
option?是否可以使用
spark-submit
(或pyspark
) --jars
选项直接引用.jar
文件?
So far, I have tried the following:到目前为止,我已经尝试了以下方法:
spark-shell --packages com.amazonaws:aws-java-sdk:1.12.336,org.apache.hadoop:hadoop-aws:3.3.4 --jars s3a://bucket/path/to/jar/file.jar
The AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
variables are correctly set, since when running the same command without the --jars
option, other files in the same bucket are successfully read.正确设置了
AWS_ACCESS_KEY_ID
和AWS_SECRET_ACCESS_KEY
变量,因为在没有--jars
选项的情况下运行相同的命令时,可以成功读取同一存储桶中的其他文件。 However, if the option is added, I get the following error:但是,如果添加该选项,我会收到以下错误:
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:364)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
... 27 more
I'm using Spark 3.3.1 pre-built for Apache Hadoop 3.3 and later.我正在使用为 Apache Hadoop 3.3 及更高版本预构建的 Spark 3.3.1。
This may be because when in client mode - Spark during its boot distributes the Jars (specified in --jars) via Netty first.这可能是因为在客户端模式下 - Spark 在启动期间首先通过 Netty 分发 Jars(在 --jars 中指定)。 To download a remote JAR from a third-party file system (ie S3), it'll need the right dependency (ie hadoop-aws) in the classpath already (before it prepares the final classpath).
要从第三方文件系统(即 S3)下载远程 JAR,它需要类路径中已经存在正确的依赖项(即 hadoop-aws)(在准备最终类路径之前)。
But since it is yet to distribute the JARs it has not prepared the classpath - thus when it tries to download the JAR from s3, it fails with ClassNotFound (as hadoop-aws is yet to be on the classpath), but when doing the same in the application code it succeeds - as by that time the classpath has been resolved.但由于它尚未分发 JARs,它还没有准备好类路径 - 因此当它尝试从 s3 下载 JAR 时,它因 ClassNotFound 而失败(因为 hadoop-aws 尚未在类路径上),但在执行相同操作时在应用程序代码中它成功了——因为那时类路径已经被解析。
ie Downloading the dependency is dependent on a library that will be loaded later.即下载依赖是依赖一个库,后面会加载。
To run Apache Spark applications with a JAR dependency from Amazon S3, you can use the --jars command-line option to specify the S3 URL of the JAR file when submitting the Spark application.要从 Amazon S3 运行具有 JAR 依赖项的 Apache Spark 应用程序,您可以在提交 Spark 应用程序时使用 --jars 命令行选项指定 JAR 文件的 S3 URL。
For example, if your JAR file is stored in the my-bucket S3 bucket at the jars/my-jar.jar path, you can submit the Spark application as follows:比如你的JAR文件存放在my-bucket S3 bucket的jars/my-jar.jar路径下,你可以提交Spark申请如下:
spark-submit --jars s3a://my-bucket/jars/my-jar.jar \
--class com.example.MySparkApp \
s3a://my-bucket/my-spark-app.jar
This will download the JAR file from S3 and add it to the classpath of the Spark application.这将从 S3 下载 JAR 文件并将其添加到 Spark 应用程序的类路径中。
Note that you will need to include the s3a:// prefix in the S3 URL to use the s3a filesystem connector, which is the recommended connector for reading from and writing to S3.请注意,您需要在 S3 URL 中包含 s3a:// 前缀才能使用 s3a 文件系统连接器,这是用于读取和写入 S3 的推荐连接器。 You may also need to configure the fs.s3a.access.key and fs.s3a.secret.key properties with your AWS access key and secret key in order to authenticate the connection to S3.
您可能还需要使用 AWS 访问密钥和秘密密钥配置 fs.s3a.access.key 和 fs.s3a.secret.key 属性,以便验证与 S3 的连接。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.