简体   繁体   English

泊坞窗中的纱线-__spark_libs__.zip不存在

[英]yarn in docker - __spark_libs__.zip does not exist

I have looked through this StackOverflow post but they haven't helped me much. 我已经看过这篇 StackOverflow的帖子,但是他们并没有太大帮助。

I am trying to get Yarn working on an existing cluster. 我试图让Yarn在现有集群上工作。 So far we have been using spark standalone manger as our resource allocator and it has been working as expected. 到目前为止,我们一直在使用spark独立管理器作为我们的资源分配器,并且它已经按预期运行。

This is a basic overview of our architecture. 这是我们架构的基本概述。 Everything in the white boxes run in docker containers. 白盒中的所有内容都在docker容器中运行。 在此处输入图片说明

From master-machine I can run the following command from within the yarn resource manager container and get a spark-shell running that uses yarn: ./pyspark --master yarn --driver-memory 1G --executor-memory 1G --executor-cores 1 --conf "spark.yarn.am.memory=1G" master-machine我可以从纱线资源管理器容器中运行以下命令,并运行使用纱线的spark-shell: ./pyspark --master yarn --driver-memory 1G --executor-memory 1G --executor-cores 1 --conf "spark.yarn.am.memory=1G"

However, if I try to run the same command from client-machine within the jupyter container I get the following error in the YARN-UI . 但是,如果尝试从jupyter容器中的client-machine运行相同的命令,则会在YARN-UI中收到以下错误

Application application_1512999329660_0001 failed 2 times due to AM 
Container for appattempt_1512999329660_0001_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://master-machine:5000/proxy/application_1512999329660_0001/Then, click on links to logs of each attempt.
Diagnostics: File file:/sparktmp/spark-58732bb2-f513-4aff-b1f0-27f0a8d79947/__spark_libs__5915104925224729874.zip does not exist
java.io.FileNotFoundException: File file:/sparktmp/spark-58732bb2-f513-4aff-b1f0-27f0a8d79947/__spark_libs__5915104925224729874.zip does not exist

I can find file:/sparktmp/spark-58732bb2-f513-4aff-b1f0-27f0a8d79947/ on the client-machine but I am unable to find spark-58732bb2-f513-4aff-b1f0-27f0a8d79947 on the master machine 我可以在client-machine上找到file:/sparktmp/spark-58732bb2-f513-4aff-b1f0-27f0a8d79947/ ,但无法在master machine上找到spark-58732bb2-f513-4aff-b1f0-27f0a8d79947

As a note, spark-shell works from the client-machine when it points to the standalone spark manager on the master machine . 值得注意的是,spark-shell在指向master machine上的独立火花管理器时从client-machine master machine

No logs are printed to the yarn log directories on the worker-machines either. 也没有任何日志打印到工作机上的纱线日志目录中。

If I run a spark-submit on spark/examples/src/main/python/pi.py I get the same error as above. 如果我在spark / examples / src / main / python / pi.py上运行spark-submit, 则会收到与上述相同的错误。

Here is the yarn-site.xml 这是yarn-site.xml

<configuration>
  <property>
    <description>YARN hostname</description>
    <name>yarn.resourcemanager.hostname</name>
    <value>master-machine</value>
  </property>

  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    <!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler</value> -->
    <!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> -->
  </property>

  <property>
    <description>The address of the RM web application.</description>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>${yarn.resourcemanager.hostname}:5000</value>
  </property>

  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>${yarn.resourcemanager.hostname}:8031</value>
  </property>

  <property>
    <description>The address of the scheduler interface.</description>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>${yarn.resourcemanager.hostname}:8030</value>
  </property>

  <property>
    <description>The address of the applications manager interface in the RM.</description>
    <name>yarn.resourcemanager.address</name>
    <value>${yarn.resourcemanager.hostname}:8032</value>
  </property>

  <property>
    <description>The address of the RM admin interface.</description>
    <name>yarn.resourcemanager.admin.address</name>
    <value>${yarn.resourcemanager.hostname}:8033</value>
  </property>

  <property>
    <description>Set to false, to avoid ip check</description>
    <name>hadoop.security.token.service.use_ip</name>
    <value>false</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>1000</value>
    <description>Maximum number of applications in the system which
      can be concurrently active both running and pending</description>
  </property>

  <property>
    <description>Whether to use preemption. Note that preemption is experimental
      in the current version. Defaults to false.</description>
    <name>yarn.scheduler.fair.preemption</name>
    <value>true</value>
  </property>

  <property>
    <description>Whether to allow multiple container assignments in one
      heartbeat. Defaults to false.</description>
    <name>yarn.scheduler.fair.assignmultiple</name>
    <value>true</value>
  </property>

  <property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>

</configuration>

And here is the spark.conf: 这是spark.conf:

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# DRIVER PROPERTIES
spark.driver.port 7011
spark.fileserver.port 7021
spark.broadcast.port 7031
spark.replClassServer.port 7041
spark.akka.threads 6
spark.driver.cores 4
spark.driver.memory 32g
spark.master yarn
spark.deploy.mode client

# DRIVER AND EXECUTORS
spark.blockManager.port 7051

# EXECUTORS
spark.executor.port 7101

# GENERAL
spark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory
spark.port.maxRetries 10
spark.local.dir /sparktmp
spark.scheduler.mode  FAIR

# SPARK UI
spark.ui.port 4140

# DYNAMIC ALLOCATION AND SHUFFLE SERVICE
# http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
spark.dynamicAllocation.enabled false
spark.shuffle.service.enabled false
spark.shuffle.service.port 7061
spark.dynamicAllocation.initialExecutors 5
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors 8
spark.dynamicAllocation.executorIdleTimeout 60s

# LOGGING
spark.executor.logs.rolling.maxRetainedFiles 5
spark.executor.logs.rolling.strategy size
spark.executor.logs.rolling.maxSize 100000000

# JMX
# Testing
# spark.driver.extraJavaOptions -Dcom.sun.management.jmxremote.port=8897 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

# Spark Yarn Configs
spark.hadoop.yarn.resourcemanager.address <master-machine IP>:8032
spark.hadoop.yarn.resourcemanager.hostname master-machine

And this shell script is run on all the mahcines: 这个shell脚本在所有机器上运行:

# The main ones
export CONDA_DIR=/cluster/conda
export HADOOP_HOME=/usr/hadoop
export SPARK_HOME=/usr/spark
export JAVA_HOME=/usr/java/latest

export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$CONDA_DIR/bin:/cluster/libs-python:/cluster/batch
export PYTHONPATH=/cluster/libs-python:$SPARK_HOME/python:$PY4JPATH:$PYTHONPATH
export SPARK_CLASSPATH=/cluster/libs-java/*:/cluster/libs-python:$SPARK_CLASSPATH

# Core spark configuration
export PYSPARK_PYTHON="/cluster/conda/bin/python"
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_MASTER_WEBUI_PORT=7080
export SPARK_WORKER_WEBUI_PORT=7081
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Duser.timezone=UTC+02:00"
export SPARK_WORKER_DIR="/sparktmp"
export SPARK_WORKER_CORES=22
export SPARK_WORKER_MEMORY=43G
export SPARK_DAEMON_MEMORY=1G
export SPARK_WORKER_INSTANCEs=1
export SPARK_EXECUTOR_INSTANCES=2
export SPARK_EXECUTOR_MEMORY=4G
export SPARK_EXECUTOR_CORES=2
export SPARK_LOCAL_IP=$(hostname -I | cut -f1 -d " ")
export SPARK_PUBLIC_DNS=$(hostname -I | cut -f1 -d " ")
export SPARK_MASTER_OPTS="-Duser.timezone=UTC+02:00"

This is the hdfs-site.xml on the master-machine(namenodes): 这是主机(namenodes)上的hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/hdfs</value>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>/hdfs/name</value>
   </property>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
   </property>
   <property>
      <name>dfs.replication.max</name>
      <value>3</value>
   </property>
   <property>
      <name>dfs.replication.min</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.permissions.superusergroup</name>
      <value>supergroup</value>
   </property>

   <property>
     <name>dfs.blocksize</name>
     <value>268435456</value>
   </property>

   <property>
     <name>dfs.permissions.enabled</name>
     <value>true</value>
   </property>

   <property>
     <name>fs.permissions.umask-mode</name>
     <value>002</value>
   </property>

  <property>
    <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
    <value>false</value>
  </property>

  <property>
  <!-- 1000Mbit/s -->
    <name>dfs.balance.bandwidthPerSec</name>
    <value>125000000</value>
  </property>

  <property>
    <name>dfs.hosts.exclude</name>
    <value>/cluster/config/hadoopconf/namenode/dfs.hosts.exclude</value>
    <final>true</final>
  </property>

  <property>
    <name>dfs.namenode.replication.work.multiplier.per.iteration</name>
    <value>10</value>
  </property>

  <property>
    <name>dfs.namenode.replication.max-streams</name>
    <value>50</value>
  </property>

  <property>
    <name>dfs.namenode.replication.max-streams-hard-limit</name>
    <value>100</value>
  </property>

</configuration>

And this is the hdfs-site.xml on the worker-machines (data-node): 这是工作机(数据节点)上的hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/hdfs,/hdfs2,/hdfs3</value>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>/hdfs/name</value>
   </property>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
   </property>
   <property>
      <name>dfs.replication.max</name>
      <value>3</value>
   </property>
   <property>
      <name>dfs.replication.min</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.permissions.superusergroup</name>
      <value>supergroup</value>
   </property>

   <property>
     <name>dfs.blocksize</name>
     <value>268435456</value>
   </property>

   <property>
     <name>dfs.permissions.enabled</name>
     <value>true</value>
   </property>

   <property>
     <name>fs.permissions.umask-mode</name>
     <value>002</value>
   </property>

   <property>
   <!-- 1000Mbit/s -->
     <name>dfs.balance.bandwidthPerSec</name>
     <value>125000000</value>
   </property>
</configuration>

This is the core-site.xml on the worker-machines (datanodes) 这是工作机(数据节点)上的core-site.xml。

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master-machine:54310/</value>
  </property>
</configuration>

This is the core-site.xml on the master-machine (name node): 这是主机(名称节点)上的core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master-machine:54310/</value>
  </property>
</configuration>

After a lot of debugging I was able to identify that for some reason the jupyter container was not looking in the correct hadoop conf directory even though the HADOOP_HOME environment variable was pointing to the correct location. 经过大量的调试后,即使HADOOP_HOME环境变量指向了正确的位置,我仍然能够出于某种原因识别出jupyter容器未在正确的hadoop conf目录中查找。 All I had to do to resolve the above problem was to point HADOOP_CONF_DIR to the correct directory and everything started working again. 解决上述问题所需要做的就是将HADOOP_CONF_DIR指向正确的目录,然后一切又重新开始工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM