[英]Best way to customize JARs in spark worker classpath
I am working on an ETL pipeline in spark and I find that pushing a release is time/bandwidth intensive. 我正在Spark中处理ETL管道,我发现推动发布需要大量时间/带宽。 My release script (pseudocode):
我的发布脚本(伪代码):
sbt assembly
openstack object create spark target/scala-2.11/etl-$VERSION-super.jar
spark-submit \
--class comapplications.WindowsETLElastic \
--master spark://spark-submit.cloud \
--deploy-mode cluster \
--verbose \
--conf "spark.executor.memory=16g" \
"$JAR_URL"
which works but can take over 4 minutes to assemble and a minute to push. 可以,但是组装可能需要4分钟以上的时间,而要花一分钟的时间才能完成。 My build.sbt:
我的build.sbt:
name := "secmon_etl"
version := "1.2"
scalaVersion := "2.11.8"
exportJars := true
assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar"
libraryDependencies ++= Seq (
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0",
"io.spray" %% "spray-json" % "1.3.3",
// "commons-net" % "commons-net" % "3.5",
// "org.apache.httpcomponents" % "httpclient" % "4.5.2",
"org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1"
)
assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) {
(old) => {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
The issue appears to be the sheer size of the elasticsearch-spark-20_2.11. 问题似乎是elasticsearch-spark-20_2.11的绝对大小。 It adds about 90MB to my uberjar.
它为我的uberjar添加了大约90MB。 I would be happy to turn it in to a
provided
dependency on the spark host, making it unnecessary to package. 我很乐意将其转为对Spark主机的
provided
依赖关系,从而无需打包。 The question is, what's the best way to do that? 问题是,这样做的最佳方法是什么? Should I just manually copy over jars or is there a foolproof way of specifying a dependency and having a tool resolve all the transitive dependencies?
我应该只是手动复制jar上的内容,还是有一种万无一失的方式来指定依赖项并让工具解决所有传递性依赖项?
I have my spark jobs running and much more quickly now. 我的火花作业正在运行,并且现在运行得更快。 I ran
我跑了
sbt assemblyPackageDependency
which generated a huge jar (110MB!), easily placed in the spark working directory 'jars' folder, so now my Dockerfile for a spark cluster looks like this: 生成了一个巨大的jar(110MB!),可以轻松地将其放置在spark工作目录的jars文件夹中,因此,现在我用于Spark集群的Dockerfile如下所示:
FROM openjdk:8-jre
ENV SPARK_VERSION 2.1.0
ENV HADOOP_VERSION hadoop2.7
ENV SPARK_MASTER_OPTS="-Djava.net.preferIPv4Stack=true"
RUN apt-get update && apt-get install -y python
RUN curl -sSLO http://mirrors.ocf.berkeley.edu/apache/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz && tar xzfC /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz /usr/share && rm /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz
# master or worker's webui port,
EXPOSE 8080
# master's rest api port
EXPOSE 7077
ADD deps.jar /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION/jars/
WORKDIR /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION
after deploying that configuration I changed my build.sbt so the kafka-streaming
/ elasticsearch-spark
jars and dependencies are marked as provided
: 部署该配置后,我更改了build.sbt,因此将
kafka-streaming
/ elasticsearch elasticsearch-spark
jar和依赖项标记为provided
:
name := "secmon_etl"
version := "1.2"
scalaVersion := "2.11.8"
exportJars := true
assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar"
libraryDependencies ++= Seq (
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0" % "provided",
"io.spray" %% "spray-json" % "1.3.3" % "provided",
"org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1" % "provided"
)
assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) {
(old) => {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
Now my deploys go through in 20 seconds! 现在,我的部署将在20秒内完成!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.