简体   繁体   English

如果数据适合单个机器,那么使用Spark是否有意义?

[英]If data fits on a single machine does it make sense to use Spark?

I have 20GB of data that requires processing, all of this data fits on my local machine. 我有20GB的数据需要处理,所有这些数据都适合我的本地机器。 I'm planning on using Spark or Scala parallel colleections to implement some algorithms and matrix multiplication against this data. 我打算使用Spark或Scala并行收集来对这些数据实现一些算法和矩阵乘法。

Since the data fits on a single machine should I use Scala parallel collections ? 由于数据适合单个机器,我应该使用Scala并行集合吗?

Is this true : The main bottleneck in parallel tasks is getting the data to the CPU for processing, so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement ? 这是真的:并行任务的主要瓶颈是将数据传送到CPU进行处理,因为所有数据都尽可能接近CPU,因此Spark不会带来任何显着的性能提升吗?

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ? 即使它只是在一台机器上运行,Spark也会设置并行任务的开销,所以这种开销在这种情况下是多余的?

It's hard to provide some non-obvious instructions, like if you had your data and doesn't goes up to the 80% of memory and ..., then use local mode . 很难提供一些非显而易见的指令,比如你有你的数据并且没有达到80%的内存和......,然后使用本地模式 Having said this, there are couple of points, which, in general, may make you use spark even if your data fits one's machine memory: 说到这一点,有一些要点,即使你的数据适合一个人的机器内存,一般来说,这可能会让你使用spark:

  1. really intensive CPU processing, from the top of my head, it might be complicated parsing of texts 真正密集的CPU处理,从我的头脑,它可能是复杂的文本解析
  2. stability -- say you have many processing stages and you don't want to lose results, once your single machine goes down. 稳定性 - 假设您有多个处理阶段,并且在单台机器出现故障时您不希望丢失结果。 it's especially important in case you have recurrent calculations, not one-off queries (this way, time you spend on bringing spark to the table might pay-off) 特别重要的是,如果你有经常性的计算,而不是一次性的查询(这样,你花费在桌子上花费的时间可能会得到回报)
  3. streaming -- you get your data from somewhere in a stream manner, and, though it's snapshot fits single machine, you have to orchestrate it somehow 流 - 你以流方式从某个地方获取数据,虽然它的快照适合单机,但你必须以某种方式编排它

In your particular case 在你的特殊情况下

so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement 因此,由于所有数据都与CPU相近,因此Spark不会提供任何显着的性能提升

Of course it's not, Spark is not a voodoo magic that somehow might get your data closer to the CPU, but it can help you scale among machines, thus CPUs (point #1) 当然不是,Spark不是巫术魔术,不知何故可以让你的数据更接近CPU,但它可以帮助你在机器之间扩展,从而缩小CPU(点#1)

Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ? 即使它只是在一台机器上运行,Spark也会设置并行任务的开销,所以这种开销在这种情况下是多余的?

I may sound captain obvious, but 我可能听起来很明显,但是

  1. Take #2 and #3 into consideration, do you need them? 考虑#2和#3,你需要它们吗? If yes, go spark or something else 如果是的话,去火花或其他东西
  2. If no, implement your processing in a dumb way (parallel collections) 如果不是,请以愚蠢的方式实施处理(并行集合)
  3. Profile and take a look. 简介并看一看。 Are your processing is CPU bound? 您的处理是否受CPU限制? Can you speed up it, without lot of tweaks? 你可以加快它,没有很多调整? If no, go spark. 如果不是,请去火花。

There is also [cheeky] point 4) in the list of Why should I use Spark? 我为什么要使用Spark的列表中还有[厚颜无耻]第4点) . It's the hype -- Spark is a very sexy technology which is easy to "sell" to both your devs (it's the cutting edge of big data) and the company (your boss, in case you're building your own product, your customer in case you're building product for somebody else). 这是炒作 - Spark是一种非常性感的技术,很容易“销售”给你的开发者(它是大数据的最前沿)和公司(你的老板,如果你正在建立自己的产品,你的客户如果你正在为别人建造产品)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用“ private [this] lazy val”是否有意义? - Does it make sense to use “private[this] lazy val”? 使用Actors池是否有意义? - Does it make sense to use a pool of Actors? 编写UDAF在spark数据帧上执行滚动回归是否有意义? - Does it make sense to write a UDAF to perform a rolling regression on a spark dataframe? 什么时候使用元组而不是案例类是有意义的 - When does it make sense to use tuples over case class 使用Scala,功能范例是否对分析实时数据有意义? - Using Scala, does a functional paradigm make sense for analyzing live data? 将Slick(JDBC)连接器用于Alpakka时使用分页SQL语句是否有意义 - Does it make sense to use Paging SQL Statement when using Slick (JDBC) Connector for Alpakka 什么时候在Scala中使用隐式参数是有意义的,可以考虑哪些替代scala习语? - When does it make sense to use implicit parameters in Scala, and what may be alternative scala idioms to consider? 在Scala中使用模式匹配是否有任何意义? - Does it make any sense to use pattern matching in Scala with really simple cases? 带有AnyRef作为参数的AnyVal在Scala中是否有意义? - Does the AnyVal with AnyRef as a parameter make sense in Scala? 返回存在类型是否有意义? - Does it make sense to return an existential type?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM