[英]Best way to customize JARs in spark worker classpath
我正在Spark中處理ETL管道,我發現推動發布需要大量時間/帶寬。 我的發布腳本(偽代碼):
sbt assembly
openstack object create spark target/scala-2.11/etl-$VERSION-super.jar
spark-submit \
--class comapplications.WindowsETLElastic \
--master spark://spark-submit.cloud \
--deploy-mode cluster \
--verbose \
--conf "spark.executor.memory=16g" \
"$JAR_URL"
可以,但是組裝可能需要4分鍾以上的時間,而要花一分鍾的時間才能完成。 我的build.sbt:
name := "secmon_etl"
version := "1.2"
scalaVersion := "2.11.8"
exportJars := true
assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar"
libraryDependencies ++= Seq (
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0",
"io.spray" %% "spray-json" % "1.3.3",
// "commons-net" % "commons-net" % "3.5",
// "org.apache.httpcomponents" % "httpclient" % "4.5.2",
"org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1"
)
assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) {
(old) => {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
問題似乎是elasticsearch-spark-20_2.11的絕對大小。 它為我的uberjar添加了大約90MB。 我很樂意將其轉為對Spark主機的provided
依賴關系,從而無需打包。 問題是,這樣做的最佳方法是什么? 我應該只是手動復制jar上的內容,還是有一種萬無一失的方式來指定依賴項並讓工具解決所有傳遞性依賴項?
我的火花作業正在運行,並且現在運行得更快。 我跑了
sbt assemblyPackageDependency
生成了一個巨大的jar(110MB!),可以輕松地將其放置在spark工作目錄的jars文件夾中,因此,現在我用於Spark集群的Dockerfile如下所示:
FROM openjdk:8-jre
ENV SPARK_VERSION 2.1.0
ENV HADOOP_VERSION hadoop2.7
ENV SPARK_MASTER_OPTS="-Djava.net.preferIPv4Stack=true"
RUN apt-get update && apt-get install -y python
RUN curl -sSLO http://mirrors.ocf.berkeley.edu/apache/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz && tar xzfC /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz /usr/share && rm /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz
# master or worker's webui port,
EXPOSE 8080
# master's rest api port
EXPOSE 7077
ADD deps.jar /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION/jars/
WORKDIR /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION
部署該配置后,我更改了build.sbt,因此將kafka-streaming
/ elasticsearch elasticsearch-spark
jar和依賴項標記為provided
:
name := "secmon_etl"
version := "1.2"
scalaVersion := "2.11.8"
exportJars := true
assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar"
libraryDependencies ++= Seq (
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0" % "provided",
"io.spray" %% "spray-json" % "1.3.3" % "provided",
"org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1" % "provided"
)
assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) {
(old) => {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
現在,我的部署將在20秒內完成!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.