简体   繁体   English

如何从 AWS S3 运行具有 JAR 依赖项的 Apache Spark 应用程序?

[英]How to run Apache Spark applications with a JAR dependency from AWS S3?

I have a .jar file containing useful functions for my application located in an AWS S3 bucket, and I want to use it as a dependency in Spark without having to first download it locally.我有一个.jar文件,其中包含对位于 AWS S3 存储桶中的应用程序有用的函数,我想将它用作 Spark 中的依赖项,而不必先在本地下载它。 Is it possible to directly reference the .jar file with spark-submit (or pyspark ) --jars option?是否可以使用spark-submit (或pyspark--jars选项直接引用.jar文件?

So far, I have tried the following:到目前为止,我已经尝试了以下方法:

spark-shell --packages com.amazonaws:aws-java-sdk:1.12.336,org.apache.hadoop:hadoop-aws:3.3.4 --jars s3a://bucket/path/to/jar/file.jar

The AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY variables are correctly set, since when running the same command without the --jars option, other files in the same bucket are successfully read.正确设置了AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY变量,因为在没有--jars选项的情况下运行相同的命令时,可以成功读取同一存储桶中的其他文件。 However, if the option is added, I get the following error:但是,如果添加该选项,我会收到以下错误:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
    at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
    at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
    at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:364)
    at scala.Option.map(Option.scala:230)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
    ... 27 more

I'm using Spark 3.3.1 pre-built for Apache Hadoop 3.3 and later.我正在使用为 Apache Hadoop 3.3 及更高版本预构建的 Spark 3.3.1。

This may be because when in client mode - Spark during its boot distributes the Jars (specified in --jars) via Netty first.这可能是因为在客户端模式下 - Spark 在启动期间首先通过 Netty 分发 Jars(在 --jars 中指定)。 To download a remote JAR from a third-party file system (ie S3), it'll need the right dependency (ie hadoop-aws) in the classpath already (before it prepares the final classpath).要从第三方文件系统(即 S3)下载远程 JAR,它需要类路径中已经存在正确的依赖项(即 hadoop-aws)(在准备最终类路径之前)。

But since it is yet to distribute the JARs it has not prepared the classpath - thus when it tries to download the JAR from s3, it fails with ClassNotFound (as hadoop-aws is yet to be on the classpath), but when doing the same in the application code it succeeds - as by that time the classpath has been resolved.但由于它尚未分发 JARs,它还没有准备好类路径 - 因此当它尝试从 s3 下载 JAR 时,它因 ClassNotFound 而失败(因为 hadoop-aws 尚未在类路径上),但在执行相同操作时在应用程序代码中它成功了——因为那时类路径已经被解析。

ie Downloading the dependency is dependent on a library that will be loaded later.即下载依赖是依赖一个库,后面会加载。

To run Apache Spark applications with a JAR dependency from Amazon S3, you can use the --jars command-line option to specify the S3 URL of the JAR file when submitting the Spark application.要从 Amazon S3 运行具有 JAR 依赖项的 Apache Spark 应用程序,您可以在提交 Spark 应用程序时使用 --jars 命令行选项指定 JAR 文件的 S3 URL。

For example, if your JAR file is stored in the my-bucket S3 bucket at the jars/my-jar.jar path, you can submit the Spark application as follows:比如你的JAR文件存放在my-bucket S3 bucket的jars/my-jar.jar路径下,你可以提交Spark申请如下:

spark-submit --jars s3a://my-bucket/jars/my-jar.jar \
  --class com.example.MySparkApp \
  s3a://my-bucket/my-spark-app.jar

This will download the JAR file from S3 and add it to the classpath of the Spark application.这将从 S3 下载 JAR 文件并将其添加到 Spark 应用程序的类路径中。

Note that you will need to include the s3a:// prefix in the S3 URL to use the s3a filesystem connector, which is the recommended connector for reading from and writing to S3.请注意,您需要在 S3 URL 中包含 s3a:// 前缀才能使用 s3a 文件系统连接器,这是用于读取和写入 S3 的推荐连接器。 You may also need to configure the fs.s3a.access.key and fs.s3a.secret.key properties with your AWS access key and secret key in order to authenticate the connection to S3.您可能还需要使用 AWS 访问密钥和秘密密钥配置 fs.s3a.access.key 和 fs.s3a.secret.key 属性,以便验证与 S3 的连接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将Amazon AWS S3 Bucket中的数据复制到Apache Airflow MariaDB SQL表中 - How to copy data from Amazon AWS S3 Bucket into a MariaDB SQL table in Apache Airflow Apache Spark 3.1.2 无法通过记录的 spark-hadoop-cloud 从 S3 读取 - Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud AWS EMR 步骤找不到从 s3 导入的 jar - AWS EMR step doesn't find jar imported from s3 如何加密s3中存在的apache hudi外部表数据通过spark作业同步到hive表中 - How to encrypt apache hudi external tables data present in s3 synced into hive tables through spark jobs 如何在刀片 laravel 5.2 中显示来自 aws s3 的图像 - How to display image from aws s3 in blade laravel 5.2 如何从 apache beam python 读取 s3 文件? - how to read s3 files from apache beam python? 使用 org.apache.hadoop:hadoop-aws 从 pyspark 中的 s3 读取文件 - Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws 如何使用 python 从公共 AWS s3 下载文件? - How to download a file from public AWS s3 with python? AWS Quicksight:如何从 S3 获取最新数据 - AWS Quicksight: How to Get the latest Data from S3 如何从 Amazon S3 存储桶中读取数据并调用 AWS 服务 - How to read from an Amazon S3 Bucket and call AWS services
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM