简体   繁体   中英

Loading same file for each mapper

Let's say we have 10 data points and 5 mappers and the goal is to compute the distance between the points. Normally this takes O(N^2) by comparing each two pairs together.

What I want to do is load the whole file containing the data points to each mapper and make each mapper operate on different points. For example, let mapper #1 calculate the distance of point 1 and point 2 with all the other points, mapper #2 calculate the distance of point 3 and point 4 with all the others points and so on.

I came across this algorithm in a paper, but it had no specific way to implement it. Any ideas or suggestions on how to load the whole file to each mapper, or how to make each mapper operate on specific index through the file would be much appreciated.

Take a look at this paper , suggesting to use the "block nested loop" join (Section 3), which is slightly different than what you ask, but can easily be extended to match your needs. If you treat both R and S as one source, then, at the end, it ends up comparing all points to all other points, as you require.

For your requirements, you don't need to implement the second MapReduce job that keeps only the top-k results.

In hadoop 1.2.0 (old API), you can get the total number of mappers by using the conf.get("mapred.map.tasks") command and the current mapper, by using the conf.get("mapred.task.partition") command.

However, to answer your question on how to get the same file for all mappers, you can use the Distributed Cache .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM