简体   繁体   English

如何使用本地 JAR 文件运行 Spark 结构化流

[英]How to run Spark structured streaming using local JAR files

I'm using one of the Docker images of EMR on EKS (emr-6.5.0:20211119) and investigating how to work on Kafka with Spark Structured Programming (pyspark).我在 EKS (emr-6.5.0:20211119) 上使用 EMR 的 Docker 图像之一,并研究如何使用 Spark 结构化编程 (pyspark) 在 Kafka 上工作。 As per the integration guide , I run a Python script as following.根据集成指南,我运行了一个 Python 脚本,如下所示。

$SPARK_HOME/bin/spark-submit \
  --deploy-mode client \
  --master local \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
  <myscript>.py

It download the package from Maven central and I see some JAR files are downloaded into ~/.ivy2/jars .它从 Maven 中央下载 package,我看到一些 JAR 文件被下载到~/.ivy2/jars中。

com.github.luben_zstd-jni-1.4.8-1.jar       org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.2.jar             org.slf4j_slf4j-api-1.7.30.jar
org.apache.commons_commons-pool2-2.6.2.jar  org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.2.jar  org.spark-project.spark_unused-1.0.0.jar
org.apache.kafka_kafka-clients-2.6.0.jar    org.lz4_lz4-java-1.7.1.jar                                       org.xerial.snappy_snappy-java-1.1.8.2.jar

However I find the main JAR file is already download into $SPARK_HOME/external/lib and I wonder how to make use of it instead of downloading it.但是我发现主要的 JAR 文件已经下载到$SPARK_HOME/external/lib中,我想知道如何使用它而不是下载它。

spark-avro_2.12-3.1.2-amzn-1.jar          spark-ganglia-lgpl.jar                      spark-streaming-kafka-0-10-assembly_2.12-3.1.2-amzn-1.jar   spark-streaming-kinesis-asl-assembly.jar
spark-avro.jar                            **spark-sql-kafka-0-10_2.12-3.1.2-amzn-1.jar  spark-streaming-kafka-0-10-assembly.jar                     spark-token-provider-kafka-0-10_2.12-3.1.2-amzn-1.jar
spark-ganglia-lgpl_2.12-3.1.2-amzn-1.jar  **spark-sql-kafka-0-10.jar                    spark-streaming-kinesis-asl-assembly_2.12-3.1.2-amzn-1.jar  spark-token-provider-kafka-0-10.jar

UPDATE 2022-03-09更新 2022-03-09

I tried after updating spark-defaults.conf as shown below - added the external lib folder.我在更新spark-defaults.conf后尝试如下所示 - 添加了外部 lib 文件夹。

spark.driver.extraClassPath      /usr/lib/spark/external/lib/*:...
spark.driver.extraLibraryPath    ...
spark.executor.extraClassPath    /usr/lib/spark/external/lib/*:...
spark.executor.extraLibraryPath  ...

I can run without --packages but it fails with the following error.我可以在没有--packages的情况下运行,但它失败并出现以下错误。

22/03/09 05:37:25 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<init>(KafkaDataConsumer.scala:623)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.<init>(KafkaBatchPartitionReader.scala:52)
        at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:40)
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:60)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.pool2.impl.GenericKeyedObjectPoolConfig
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 33 more

It doesn't help although I tried with adding --packages org.apache.commons:commons-pool2:2.6.2 .尽管我尝试添加--packages org.apache.commons:commons-pool2:2.6.2 ,但这并没有帮助。

You would use --jars to refer to local filesystem in-place of --packages您将使用--jars --packages引用本地文件系统

Unfortunately I cannot submit an app only with the JAR files in $SPARK_HOME/external/lib due to an error.不幸的是,由于错误,我无法提交仅包含$SPARK_HOME/external/lib中 JAR 文件的应用程序。 The details of the error are updated to the question.错误的详细信息已更新到问题中。 Instead I ended up pre-downloading the package JAR files and using those.相反,我最终预下载了 package JAR 文件并使用了这些文件。

I first ran with the following command.我首先使用以下命令运行。 Here foo.py is an empty file and it'll download the package JAR files into /home/hadoop/.ivy2/jars .这里的foo.py是一个空文件,它将 package JAR 文件下载到/home/hadoop/.ivy2/jars中。

$SPARK_HOME/bin/spark-submit \
  --deploy-mode client \
  --master local \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
  foo.py

Then I updated spark-defaults.conf as following.然后我更新spark-defaults.conf如下。

spark.driver.extraClassPath      /home/hadoop/.ivy2/jars/*:...
spark.driver.extraLibraryPath    ...
spark.executor.extraClassPath    /home/hadoop/.ivy2/jars/*:...
spark.executor.extraLibraryPath  ...

After that, I ran the submit command without --packages and it worked without an error.在那之后,我运行了没有--packages的提交命令并且它没有错误地工作。

$SPARK_HOME/bin/spark-submit \
  --deploy-mode client \
  --master local \
  <myscript>.py

This approach is likely to be useful when it takes long to download package JAR files as they can be pre-downloaded.当下载 package JAR 文件需要很长时间时,这种方法可能会有用,因为它们可以预先下载。 Note EMR on EKS supports using a custom image from ECR.注意 EKS 上的 EMR 支持使用来自 ECR 的自定义图像。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Structured Streaming - stderr 被填满 - Spark Structured Streaming - stderr getting filled up Scala schema_of_json function 在 spark 结构化流中失败 - Scala schema_of_json function fails in spark structured streaming Spark Structured Streaming 应用程序中的多个 S3 凭据 - Multiple S3 credentials in a Spark Structured Streaming application 使用托管身份访问 Spark Streaming 中的 Eventhub - Accessing Eventhub in spark streaming using managed identiry 如何为 Spark Structural Streaming 创建 KinesisSink - How to create a KinesisSink for Spark Structural Streaming 如何从 AWS S3 运行具有 JAR 依赖项的 Apache Spark 应用程序? - How to run Apache Spark applications with a JAR dependency from AWS S3? 如何访问 Spark Streaming 应用程序的统计端点? - How to access statistics endpoint for a Spark Streaming application? 如何使用 Spark Session 列出 S3 存储桶中的文件? - How to list files in S3 bucket using Spark Session? 如何在控制台中执行 writeStream a dataframe? (Scala Spark 流媒体) - How to do writeStream a dataframe in console? (Scala Spark Streaming) Google Cloud Run 中的结构化日志未被解析(使用 Winston 进行日志记录) - Structured Logs in Google Cloud Run not being parsed (using Winston for logging)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM