简体   繁体   中英

Running map reduce over geographically located VMs - How bad is this setup for a hadoop cluster?

Like the subject reads, is it important that I get dedicated hardware to run a hadoop cluster and not VMs? If yes, what is acceptable network latency? Are you required to have Gigabit ethernet? I would like to leverage hadoop in speeding up an ETL process. In trying to do so, I did setup a few VMs (512-1GB RAM, 1core per VM of a dual core 2.2Mhz CPU) which are about 500 miles apart, with a network latency of 10-25ms on a 100Mpbs ethernet. I am unable to match a single machine performance for my ETL process, with 3-4 VMs as nodes. So, I thought I would ask this question here for more insight.

It greatly depends on your tasks, but, generally, it's all important - including network latencies, bandwidths, CPU loads / availability,

I can picture a few scenarios where network bandwidth would be not very important - for example, if you've already loaded your data array to a HDFS, ie it's cleanly distributed across all the nodes, and you're going to do a complex computation on this array in mappers, without reducers at all or with very little fraction of that data going to reducers. For example, if you're going to count the number of lines in text files, mappers would read multi-gigabyte files and push only one simple number to reducers - number of lines. Reducers would sum up these numbers and push single answer in the output. It's virtually nothing transferred across the network => no effect on performance.

However, in real life, you'd encounter such tasks rather rarely. Usually there are some group-by going on between mappers and reducers and thus most of the calculation-per-group is performed by reducers - ie reducers have to transfer all the data from mappers, usually using the network heavily.

If you'll tell more about your tasks, I can give more detailed estimations of what hardware you'd want to use and what are the weak points of current solution.

Dedicated hardware is always important.
Your VMs have definitely not enough RAM, network latency will matter, but 100Mbps is probably enough with 3-4 nodes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM