简体繁体 English

Spark独立

[英]Spark Standalone

原文 2015-07-21 10:06:42 2 1 java/ python/ scala/ apache-spark

I have Ubuntu 14.04 with 4 cpus on my machine ( nproc get 4 back). 我的机器上安装了具有4 cpus的Ubuntu 14.04（ nproc返回4）。 After I have installed and executed Spark Standalone (local), I can someself define the different number of slaves. 安装并执行Spark Standalone （本地）后，我可以自行定义不同数量的从站。 For example I want to have 4 slaves (workers). 例如，我想有4个奴隶（工人）。 After the execution of this number of slaves, I had next spark standalone screen: 执行完如此数量的奴隶之后，我有了下一个独立屏幕：

在此处输入图片说明

How is it possible that I have total number of corse 16 (orange field) and Memory 11 GB, if I have for a uinique worker already 4 cores (I think 1 core is 1 cpu)? 如果我已经为uinique worker获得了4个核心（我认为1个核心为1 cpu），那么我怎么可能拥有corse 16（橙色字段）和Memory 11 GB的总数？ And what is an avantage, if I have 4 slaves instead of one? 如果我有4个奴隶而不是一个奴隶，那是什么优势？ Probably, if I execute it local, I don't have any (it will be also slower), but if I have a hadoop cluster, how the cores should be shared and how I can improve the speed of programm execution? 也许，如果我在本地执行它，我什么都没有（它也会变慢），但是如果我有一个hadoop集群，应该如何共享内核以及如何提高编程执行速度？ Some additional question, if I start some application (scala, python or java) the first one is RUNNING, the other 2 or 3 should be in WAITING mode. 还有一些其他问题，如果我启动某个应用程序（scala，python或java），则第一个应用程序正在运行，而其他2或3应该处于等待模式。 Is it possible to run all applications parallel to each other? 是否可以并行运行所有应用程序？

1 个解决方案

You are misunderstanding several things here: 您在这里误解了几件事：

Standalone 单机版

This does not mean "local". 这并不意味着“本地”。 Standalone mode is the application master builtin Spark, which can be replaced by YARN or MESOS . 独立模式是应用程序主控内置的Spark，可以用YARN或MESOS代替。 You can use as many nodes as you want. 您可以根据需要使用任意数量的节点。 You can indeed only run locally, on a given number X of threads, by, for example, running the ./bin/spark-shell --master local[X] command. 实际上，例如，通过运行./bin/spark-shell --master local[X]命令，您实际上只能在给定数量的X线程上./bin/spark-shell --master local[X]运行。

Cores/memory 核心/内存

Those number reflect the total amount of resources in your cluster, rounded up. 这些数字反映了群集中的资源总量（四舍五入）。 Here, if we do the math, you have 4 * 4 cpus = 16 cpus , and 4 * 2.7 GB ~= 11 GB of memory. 在这里，如果进行数学计算，您将拥有4 * 4 cpus = 16 cpus和4 * 2.7 GB ~= 11 GB的内存。

Resource management 资源管理

If I have a hadoop cluster, how the cores should be shared 如果我有Hadoop集群，应该如何共享内核

A Hadoop cluster is different from Spark cluster. Hadoop集群与Spark集群不同。 There is several ways to combine both of them, but most of the time the part of Hadoop you'll be using in combination with Spark is HDFS, the distributed filesystem. 有两种方法可以将它们结合在一起，但是大多数时候，与Spark结合使用的Hadoop部分是HDFS，即分布式文件系统。

Depending on the application master you're using with Spark, the cores will be managed differently: 根据您与Spark一起使用的应用程序母版，将对内核进行不同的管理：

YARN use node managers on the nodes, to launch containers in which you can launch Spark's Executors (one executor = one jvm) YARN使用节点上的节点管理器来启动可以在其中启动Spark的执行程序的 容器（一个执行程序=一个jvm）
Spark Standalone use workers as a gateway to launch the Executors Spark Standalone使用工作者作为启动执行程序的网关
Mesos directly launch executors Mesos直接启动执行程序

Scheduling 排程

Hadoop and Spark use a technique known as delay scheduling , which basically rely on the principle that an application can decide to refuse an offer from a worker, to place one of it's tasks, with hope that it can later receive a better offer, in terms of data locality . Hadoop和Spark使用一种称为延迟调度的技术，该技术基本上依赖于以下原理：应用程序可以决定拒绝来自工作人员的提议，以完成其任务之一，并希望以后可以收到更好的提议。 数据局部性 。

How I can improve the speed of programm execution? 如何提高编程执行速度？

This is a complex question that can not be answer without knowledge of your infrastructure, input data, and application. 这是一个复杂的问题，如果不了解您的基础架构，输入数据和应用程序，就无法回答。 Here are some of the parameters that will affect your performances: 以下是一些会影响您性能的参数：

Amount of memory available (mainly, to cache RDD that are often used) 可用内存量（主要是用于缓存经常使用的RDD）
Use of compression for your data/RDD 对数据/ RDD使用压缩
Application configuration 应用配置

Is it possible to run all applications parallel to each other? 是否可以并行运行所有应用程序？

By default, the Standalone master uses a FIFO scheduler for it's apps, but you can set up the Fair Scheduler inside an application. 默认情况下，独立母版对其应用程序使用FIFO调度程序，但是您可以在应用程序内设置公平调度程序。 For more details, see the scheduling documentation . 有关更多详细信息，请参阅计划文档。