简体   繁体   English

对于对大量数据的给定操作,是否有办法确定是否可以将数据分解为mapreduce操作?

[英]For given operations on a large set of data, is there a way to determine if the data can be decomposed into mapreduce operations?

We do stats and such on large sets of data. 我们对大量数据进行统计等。 Right now it is all done on one machine. 现在,所有操作都在一台机器上完成。 We're studying the feasibility of moving to a map-reduce paradigm where we decompose the data into subsets, run some operations on that, then combine the results. 我们正在研究迁移到map-reduce范式的可行性,在该范式中,我们将数据分解为子集,对其进行一些操作,然后合并结果。

Is there any sort of mathematical test that can be applied to a set of operations to determine if the data they operate on can be decomposed? 是否可以将某种数学测试应用于一组操作以确定它们所操作的数据是否可以分解?

Or maybe a list somewhere saying what can and cannot be decomposed? 还是某个地方列出了可以分解和不能分解的东西?

For instance, I didn't think there was a way to decompose standard deviation, but there is... 例如,我认为没有办法分解标准偏差,但是有...

edit: added tags 编辑:添加标签

Variance, as well as the mean can be calculated online (in a single pass), see wikipedia . 方差和均值可以在线计算(一次通过),请参阅Wikipedia There's also a parallel algorithm. 还有一个并行算法。

Parallel computing is best suited to problem which are "embarrassingly parallel" ie, there is no dependency between any two tasks. 并行计算最适合“令人尴尬的并行”问题,即,任何两个任务之间都没有依赖性。 Please check out http://en.wikipedia.org/wiki/Embarrassingly_parallel 请查看http://en.wikipedia.org/wiki/Embarrassingly_parallel

Also, In cases where the operations are commutative or associative, MapReduce programs can easily be optimized for better performance. 同样,在操作是可交换的或关联的情况下,可以轻松优化MapReduce程序以获得更好的性能。

Take a look at this paper: http://www.janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf . 看一下这篇论文: http : //www.janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf They have algorithms for many common statistical problems, and there is open source code available. 他们有针对许多常见统计问题的算法,并且有可用的开源代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM