简体繁体 English

如果数据适合单个机器，那么使用Spark是否有意义？

[英]If data fits on a single machine does it make sense to use Spark?

原文 2014-05-28 17:19:35 2 1 scala/ parallel-processing/ apache-spark

I have 20GB of data that requires processing, all of this data fits on my local machine. 我有20GB的数据需要处理，所有这些数据都适合我的本地机器。 I'm planning on using Spark or Scala parallel colleections to implement some algorithms and matrix multiplication against this data. 我打算使用Spark或Scala并行收集来对这些数据实现一些算法和矩阵乘法。

Since the data fits on a single machine should I use Scala parallel collections ? 由于数据适合单个机器，我应该使用Scala并行集合吗？

Is this true : The main bottleneck in parallel tasks is getting the data to the CPU for processing, so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement ? 这是真的：并行任务的主要瓶颈是将数据传送到CPU进行处理，因为所有数据都尽可能接近CPU，因此Spark不会带来任何显着的性能提升吗？

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ? 即使它只是在一台机器上运行，Spark也会设置并行任务的开销，所以这种开销在这种情况下是多余的？

1 个解决方案

It's hard to provide some non-obvious instructions, like if you had your data and doesn't goes up to the 80% of memory and ..., then use local mode . 很难提供一些非显而易见的指令，比如你有你的数据并且没有达到80％的内存和......，然后使用本地模式 。 Having said this, there are couple of points, which, in general, may make you use spark even if your data fits one's machine memory: 说到这一点，有一些要点，即使你的数据适合一个人的机器内存，一般来说，这可能会让你使用spark：

really intensive CPU processing, from the top of my head, it might be complicated parsing of texts 真正密集的CPU处理，从我的头脑，它可能是复杂的文本解析
stability -- say you have many processing stages and you don't want to lose results, once your single machine goes down. 稳定性 - 假设您有多个处理阶段，并且在单台机器出现故障时您不希望丢失结果。 it's especially important in case you have recurrent calculations, not one-off queries (this way, time you spend on bringing spark to the table might pay-off) 特别重要的是，如果你有经常性的计算，而不是一次性的查询（这样，你花费在桌子上花费的时间可能会得到回报）
streaming -- you get your data from somewhere in a stream manner, and, though it's snapshot fits single machine, you have to orchestrate it somehow 流 - 你以流方式从某个地方获取数据，虽然它的快照适合单机，但你必须以某种方式编排它

In your particular case 在你的特殊情况下

so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement 因此，由于所有数据都与CPU相近，因此Spark不会提供任何显着的性能提升

Of course it's not, Spark is not a voodoo magic that somehow might get your data closer to the CPU, but it can help you scale among machines, thus CPUs (point #1) 当然不是，Spark不是巫术魔术，不知何故可以让你的数据更接近CPU，但它可以帮助你在机器之间扩展，从而缩小CPU（点＃1）

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ? 即使它只是在一台机器上运行，Spark也会设置并行任务的开销，所以这种开销在这种情况下是多余的？

I may sound captain obvious, but 我可能听起来很明显，但是

Take #2 and #3 into consideration, do you need them? 考虑＃2和＃3，你需要它们吗？ If yes, go spark or something else 如果是的话，去火花或其他东西
If no, implement your processing in a dumb way (parallel collections) 如果不是，请以愚蠢的方式实施处理（并行集合）
Profile and take a look. 简介并看一看。 Are your processing is CPU bound? 您的处理是否受CPU限制？ Can you speed up it, without lot of tweaks? 你可以加快它，没有很多调整？ If no, go spark. 如果不是，请去火花。

There is also [cheeky] point 4) in the list of Why should I use Spark? 在我为什么要使用Spark的列表中还有[厚颜无耻]第4点）？ . 。 It's the hype -- Spark is a very sexy technology which is easy to "sell" to both your devs (it's the cutting edge of big data) and the company (your boss, in case you're building your own product, your customer in case you're building product for somebody else). 这是炒作 - Spark是一种非常性感的技术，很容易“销售”给你的开发者（它是大数据的最前沿）和公司（你的老板，如果你正在建立自己的产品，你的客户如果你正在为别人建造产品）。