简体   繁体   中英

matrix computation using hadoop mapreduce

I have a matrix with around 10000 rows. I wrote a code that should take one row in each iteration, do some long matrix computations and return one double number per each row of matrix. Since the number of operation per each row is too much, running the code takes long time. I'm thinking to implement it using MapReduce but I'm not sure it is possible or not. The main idea is splitting matrix rows into different nodes, running the jobs independently and combining the outputs together and returns an a list of numbers. Based on my understanding, just a mapper can do this job. Am I right? Is it possible? or any better idea? Thanks in advance. By the way the code is in Java.

This seems possible - some points for consideration:

You might want to run an identity mapper (one which passes each input record to the reducer) and do the row calculation in the reducer. Doing the calculation map-side will probably still cause all the calculations to be done on a single node (it's feasible that your 10000 row matrix is smaller than the input split size).

You'll want to run a large number of reducers to ensure the job is parallellized across your cluster nodes. The default partitioner will handle sending the input rows to different reducers (assuming your rows are not fixed width, in which case you should run a custom mapper that uses a counter as the output keys, instead of the default byte offset of the input row).

To bring all the results back together you'll need to run a second MR job with a single reducer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM