简体繁体 English

为每个映射器加载相同的文件

[英]Loading same file for each mapper

原文 2014-04-02 12:08:51 0 1 java/ hadoop/ mapreduce

Let's say we have 10 data points and 5 mappers and the goal is to compute the distance between the points. 假设我们有10个数据点和5个映射器，目标是计算点之间的距离。 Normally this takes O(N^2) by comparing each two pairs together. 通常，通过将每两对进行比较将这花费O（N ^ 2）。

What I want to do is load the whole file containing the data points to each mapper and make each mapper operate on different points. 我要做的是将包含数据点的整个文件加载到每个映射器，并使每个映射器在不同的点上运行。 For example, let mapper #1 calculate the distance of point 1 and point 2 with all the other points, mapper #2 calculate the distance of point 3 and point 4 with all the others points and so on. 例如，让映射器＃1计算点1和点2与所有其他点的距离，让映射器＃2计算点3和点4与所有其他点的距离，依此类推。

I came across this algorithm in a paper, but it had no specific way to implement it. 我在论文中遇到了这种算法，但是它没有实现它的特定方法。 Any ideas or suggestions on how to load the whole file to each mapper, or how to make each mapper operate on specific index through the file would be much appreciated. 对于如何将整个文件加载到每个映射器，或如何使每个映射器通过文件在特定索引上进行操作的任何想法或建议，将不胜感激。

1 个解决方案

Take a look at this paper , suggesting to use the "block nested loop" join (Section 3), which is slightly different than what you ask, but can easily be extended to match your needs. 看一下本文，建议使用“块嵌套循环”联接（第3节），该联接与您要求的联接稍有不同，但可以轻松扩展以适应您的需求。 If you treat both R and S as one source, then, at the end, it ends up comparing all points to all other points, as you require. 如果将R和S都视为一个来源，那么最后，它会根据需要将所有点与所有其他点进行比较。

For your requirements, you don't need to implement the second MapReduce job that keeps only the top-k results. 根据您的要求，您无需实施仅保留前k个结果的第二个MapReduce作业。

In hadoop 1.2.0 (old API), you can get the total number of mappers by using the conf.get("mapred.map.tasks") command and the current mapper, by using the conf.get("mapred.task.partition") command. 在hadoop 1.2.0（旧API）中，您可以通过使用conf.get("mapred.map.tasks")命令获得映射器的总数，并通过使用conf.get("mapred.map.tasks")获得当前映射器的conf.get("mapred.task.partition")命令。

However, to answer your question on how to get the same file for all mappers, you can use the Distributed Cache . 但是，要回答有关如何为所有映射器获取相同文件的问题，可以使用Distributed Cache 。