[英]If data fits on a single machine does it make sense to use Spark?
I have 20GB of data that requires processing, all of this data fits on my local machine. 我有20GB的数据需要处理,所有这些数据都适合我的本地机器。 I'm planning on using Spark or Scala parallel colleections to implement some algorithms and matrix multiplication against this data.
我打算使用Spark或Scala并行收集来对这些数据实现一些算法和矩阵乘法。
Since the data fits on a single machine should I use Scala parallel collections ? 由于数据适合单个机器,我应该使用Scala并行集合吗?
Is this true : The main bottleneck in parallel tasks is getting the data to the CPU for processing, so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement ? 这是真的:并行任务的主要瓶颈是将数据传送到CPU进行处理,因为所有数据都尽可能接近CPU,因此Spark不会带来任何显着的性能提升吗?
Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ? 即使它只是在一台机器上运行,Spark也会设置并行任务的开销,所以这种开销在这种情况下是多余的?
It's hard to provide some non-obvious instructions, like if you had your data and doesn't goes up to the 80% of memory and ..., then use local mode . 很难提供一些非显而易见的指令,比如你有你的数据并且没有达到80%的内存和......,然后使用本地模式 。 Having said this, there are couple of points, which, in general, may make you use spark even if your data fits one's machine memory:
说到这一点,有一些要点,即使你的数据适合一个人的机器内存,一般来说,这可能会让你使用spark:
In your particular case 在你的特殊情况下
so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement
因此,由于所有数据都与CPU相近,因此Spark不会提供任何显着的性能提升
Of course it's not, Spark is not a voodoo magic that somehow might get your data closer to the CPU, but it can help you scale among machines, thus CPUs (point #1) 当然不是,Spark不是巫术魔术,不知何故可以让你的数据更接近CPU,但它可以帮助你在机器之间扩展,从而缩小CPU(点#1)
Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?
即使它只是在一台机器上运行,Spark也会设置并行任务的开销,所以这种开销在这种情况下是多余的?
I may sound captain obvious, but 我可能听起来很明显,但是
There is also [cheeky] point 4) in the list of Why should I use Spark? 在我为什么要使用Spark的列表中还有[厚颜无耻]第4点) ? .
。 It's the hype -- Spark is a very sexy technology which is easy to "sell" to both your devs (it's the cutting edge of big data) and the company (your boss, in case you're building your own product, your customer in case you're building product for somebody else).
这是炒作 - Spark是一种非常性感的技术,很容易“销售”给你的开发者(它是大数据的最前沿)和公司(你的老板,如果你正在建立自己的产品,你的客户如果你正在为别人建造产品)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.