简体繁体中英

Running a standalone Hadoop application on multiple CPU cores

原文 2010-08-04 15:02:15 3 4 java/ multithreading/ command-line/ hadoop/ mapreduce

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".

When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time. Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.

When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.

The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.

What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?

I'm expecting this to be something very silly that I've overlooked.

I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367 This implements the feature I was looking for in Hadoop 0.21 It introduces the flag mapreduce.local.map.tasks.maximum to control it.

For now I've also found the solution described here in this question .

4 answers

I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.

Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2 , so I might be right.

Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine

Just for clarification... If hadoop runs in local mode you don't have parallel execution on a task level (except you're running >= hadoop 0.21 ( MAPREDUCE-1367 )). Though you can submit multiple jobs at once and these getting executed in parallel then.

All those

mapred.tasktracker.{map|reduce}.tasks.maximum

properties do only apply to the hadoop running in distributed mode!

HTH Joahnnes

What you want to do is run Hadoop in "pseudo-distributed" mode. One machine, but, running task trackers and name nodes as if it were a real cluster. Then it will (potentially) run several workers.

Note that if your input is small Hadoop will decide it's not worth parallelizing. You may have to coax it by changing its default split size.

In my experience, "typical" Hadoop jobs are I/O bound, sometimes memory-bound, way before they are CPU-bound. You may find it impossible to fully utilize all the cores on one machine for this reason.

According to this thread on the hadoop.core-user email list , you'll want to change the mapred.tasktracker.tasks.maximum setting to the max number of tasks you would like your machine to handle (which would be the number of cores).

This (and other properties you may want to configure) is also documented in the main documentation on how to setup your cluster/daemons .

Threads automatically utilizing multiple CPU cores?

Are OSGI bundles running on multiple cores

I have a multi-threaded JAVA application running on a CPU with N cores(on Linux) however it only uses 0th core and rest of the cores are idle

running standalone java application on server

Why isn't Javac running on multiple cores?

multiple cores being used one a thread (>25% CPU usage on quadcore)

running multiple MapReduce jobs in hadoop

Running a standalone Java application within Jboss/Jetty

Java threads relation to CPU cores

How can an application use multiple cores or CPUs in .NET or Java?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Threads automatically utilizing multiple CPU cores? Are OSGI bundles running on multiple cores I have a multi-threaded JAVA application running on a CPU with N cores(on Linux) however it only uses 0th core and rest of the cores are idle running standalone java application on server Why isn't Javac running on multiple cores? multiple cores being used one a thread (>25% CPU usage on quadcore) running multiple MapReduce jobs in hadoop Running a standalone Java application within Jboss/Jetty Java threads relation to CPU cores How can an application use multiple cores or CPUs in .NET or Java?

Related Tags

Running a standalone Hadoop application on multiple CPU cores

Question

4 answers

solution1
5 ACCPTED 2010-08-04 19:06:39

solution2
2 2012-05-09 07:03:45

solution3
0 2012-05-09 07:17:00

solution4
0 2010-08-04 15:30:51

Running a standalone Hadoop application on multiple CPU cores

Question

4 answers

solution1 5 ACCPTED 2010-08-04 19:06:39

solution2 2 2012-05-09 07:03:45

solution3 0 2012-05-09 07:17:00

solution4 0 2010-08-04 15:30:51

solution1
5 ACCPTED 2010-08-04 19:06:39

solution2
2 2012-05-09 07:03:45

solution3
0 2012-05-09 07:17:00

solution4
0 2010-08-04 15:30:51