简体   繁体   English

MapReduce 递归

[英]MapReduce with Recursion

Consider the following problem:考虑以下问题:

EDIT: Ignore if the algorithm below doesn't make much sense.编辑:如果下面的算法没有多大意义,请忽略。 I just put it there for the sake of it.我只是为了它把它放在那里。 The idea is that doFunc is somehow recursive.这个想法是doFunc在某种程度上是递归的。

doFunc(A):
    [a0, a1, a2, ...] <- A
    If (someCondition([a0, a1, a2, ...]) == False)
        A <- modified(A)
        r = doFunc(modified(A))
        A <- convertR(r)
    B <- someFunc1(A)
    C <- someFunc2(B)
    r <- lastFunc(D)
    return r

In this case, r is the result of the recursive function doFunc where someCondition on the list of a0, a1, a2, ... is false , the function recurses to get some kind of an optimal A for which the condition is true .在这种情况下, r 是递归函数doFunc的结果,其中a0, a1, a2, ...列表上的 someCondition 为false ,该函数递归以获取某种条件为true的最佳A

Now consider that MapReduce could individually be applied to different parts of the program - say for example converting A to a0, a1, a2, ... and then to getting the modifiedA and then someFuncI are all possible using MapReduce, how does the recursion fit into this MapReduce implementation?现在考虑 MapReduce 可以单独应用于程序的不同部分 - 例如将A转换为a0, a1, a2, ...然后使用 MapReduce 可以得到modifiedAsomeFuncI ,递归如何拟合进入这个 MapReduce 实现?

Considering this, Hadoop Streaming is kind of out of the question given I don't understand how to implement it with Recursion.考虑到这一点,Hadoop Streaming 有点不可能,因为我不明白如何使用递归来实现它。 The only other possibility is doing some form of Python Hadoop Streaming Wrapper for example dumbo or mrjob to write the code ignoring that there is recursion which is obviously going to unfold when doFunc is called recursively.唯一的另一种可能性是使用某种形式的 Python Hadoop Streaming Wrapper,例如dumbomrjob来编写代码,而忽略递归调用doFunc时显然会展开的递归。 I am wondering that how that factors in with MapReduce and what the scalability is like.我想知道这与 MapReduce 有何关系以及可扩展性如何。

Questions: I have asked the questions in the text above but they might not be clear enough.问题:我已经问过上面文本中的问题,但它们可能不够清楚。 So I'll have them laid in clear here.所以我会在这里把它们弄清楚。

  1. Does MapReduce behave well with Recursion? MapReduce 与 Recursion 表现良好吗?
  2. If so, does it scale well?如果是这样,它是否可以很好地扩展?
  3. Is there a way to implement Hadoop Streaming with functions involving recursion?有没有办法用涉及递归的函数来实现 Hadoop Streaming?

The only form of recursion which can be implemented in Hadoop is tail recursion which means that the recursive call must come at the end of the current call.可以在 Hadoop 中实现的唯一递归形式是尾递归,这意味着递归调用必须在当前调用的末尾进行。 Strictly speaking, recursion can't be emulated at all in Hadoop because the framework can't save the state of the current job while the next one (the recursive call) executes and, then, reload the current job and resume its execution.严格来说,在 Hadoop 中根本无法模拟递归,因为框架无法在下一个(递归调用)执行时保存当前作业的状态,然后重新加载当前作业并恢复其执行。 However, tail recursion can be simulated by chaining jobs, ie when one ends start the next one(s).然而,尾递归可以通过链接作业来模拟,即当一个结束时开始下一个。

I have successfully chained tens/hundreds of jobs.我已经成功地链接了数十/数百个工作。 So there is no particular problem with fusing a few(probably even thousands) jobs in a sequence.因此,按顺序融合几个(甚至可能是数千个)作业并没有什么特别的问题。 However, there is a performance penalty associated with this practice due to 3 main reasons: setting up/tearing down jobs takes time, jobs might fail and need to be restarted, jobs might have slower machines which delay the termination of that job.但是,由于 3 个主要原因,此做法会导致性能下降:设置/拆除作业需要时间,作业可能会失败并需要重新启动,作业可能具有较慢的机器,从而延迟了该作业的终止。

But, apart from these details, what I think you should do is to make sure Hadoop is what you need.但是,除了这些细节之外,我认为您应该做的是确保 Hadoop 是您所需要的。 Hadoop is a pretty specialized framework in the sense that it addresses tasks which are " data parallelizeable " ie tasks which work on (usually) big data and which can be applied either to the entire data set at once or repeatedly to small chunks of that data and, in the end, achieve the same result as when applied to the whole data set. Hadoop 是一个非常专业的框架,因为它处理“数据可并行化”的任务,即处理(通常)大数据的任务,可以一次应用于整个数据集,也可以重复应用于小块数据并最终达到与应用于整个数据集时相同的结果。 What you describe doesn't seem to fall in this category.你所描述的似乎不属于这一类。

I think you have not explained your problem well, or maybe you have misunderstood MapReduce.我认为您没有很好地解释您的问题,或者您误解了 MapReduce。

By saying recursion, if you mean that you want to put a recursive function in the Map or Reducey functions, the answer is yes.说递归,如果你的意思是你想在 Map 或 Reducey 函数中放置一个递归函数,答案是肯定的。 you can use a recursive function in both phases.您可以在两个阶段使用递归函数。 But, if you mean to define a recursive MapReduce job and you want to do that in Hadoop, it is not possible or at least it is not safe and straight forward in Hadoop to define recursive functions.但是,如果您打算定义一个递归 MapReduce 作业并且您想在 Hadoop 中这样做,那么在 Hadoop 中定义递归函数是不可能的,或者至少不是安全和直接的。
The answer to second and third questions is the same: possible for the first case and impossible if you mean recursive job.第二和第三个问题的答案是相同的:第一种情况可能,如果您的意思是递归工作,则不可能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM