简体繁体 English

我们如何在开源Spark和Hortonworks的Hadoop沙箱中使用集群？

[英]How do we use clusters in open source Spark and Hortonworks' Hadoop sandbox?

原文 2017-03-18 17:17:14 3 1 amazon-web-services/ hadoop/ apache-spark

I have a conceptual question. 我有一个概念性的问题。 I downloaded Apache Spark and Hortonworks Hadoop Sandbox. 我下载了Apache Spark和Hortonworks Hadoop沙盒。 As far as I know, we analyze big data by distributing the tasks to multiple machines or clusters. 据我所知，我们通过将任务分配到多台计算机或群集来分析大数据。 Amazon Web Services provide customers clusters when they pay for their services. 当客户为服务付费时，Amazon Web Services为他们提供集群。 But in the case of Spark or Hadoop, whose clusters I am using when I simply download these environments? 但是对于Spark或Hadoop，仅下载这些环境时正在使用的群集？ They say that these environments provide a single-node clusters, which is, I assume my computer itself. 他们说这些环境提供了一个单节点群集，也就是我假设我的计算机本身。 But then, how can I analyze big data if I am limited to my computer itself? 但是，如果我局限于计算机本身，该如何分析大数据？ In brief, what is the logic of using Spark on my own laptop? 简而言之，在我自己的笔记本电脑上使用Spark的逻辑是什么？

1 个解决方案

The environments are exactly what they say they are, a sandbox. 这些环境正是他们所说的沙盒。 It can be used to test functionality but not performance because as you rightly said, they are running out of your laptop. 它可以用来测试功能，但不能测试性能，因为正如您正确地说的那样，它们已经用完了您的笔记本电脑。 The VM comes configured with all the software neccesary for you to test exactly this. VM随附了所有必需的软件，供您测试。

If you wish to get the true performance potential of spark, then you will need to install spark on a cluster of servers using the procedures that they describe here and then you will be truly using the computational power from the servers that you just installed spark on. 如果希望获得Spark的真正性能潜力，则需要使用此处描述的过程在服务器群集上安装spark，然后才能真正利用刚刚安装spark的服务器的计算能力。

Hope that helps! 希望有帮助！