简体   繁体   English

如何从eclipse调试hadoop mapreduce作业?

[英]How to debug hadoop mapreduce jobs from eclipse?

I'm running hadoop in a single-machine, local-only setup, and I'm looking for a nice, painless way to debug mappers and reducers in eclipse. 我在单机,仅限本地的设置中运行hadoop,我正在寻找一种在eclipse中调试映射器和减速器的一种不错的,无痛的方法。 Eclipse has no problem running mapreduce tasks. Eclipse运行mapreduce任务没有问题。 However, when I go to debug, it gives me this error : 但是,当我去调试时,它给了我这个错误:

12/03/28 14:03:23 WARN mapred.JobClient: No job jar file set. 12/03/28 14:03:23 WARN mapred.JobClient:没有工作jar文件集。 User classes may not be found. 可能找不到用户类。 See JobConf(Class) or JobConf#setJar(String). 请参阅JobConf(Class)或JobConf#setJar(String)。

Okay, so I do some research. 好的,我做了一些研究。 Apparently, I should use eclipse's remote debugging facility, and add this to my hadoop-env.sh : 显然,我应该使用eclipse的远程调试工具,并将其添加到我的hadoop-env.sh

-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000

I do that and I can step through my code in eclipse. 我这样做,我可以在eclipse中逐步完成我的代码。 Only problem is that, because of the "suspend=y", I can't use the "hadoop" command from the command line to do things like look at the job queue; 唯一的问题是,由于“suspend = y”,我无法使用命令行中的“hadoop”命令来执行查看作业队列等操作; it hangs, I'm imagining because it's waiting for a debugger to attach. 它挂起,我想象,因为它正在等待调试器附加。 Also, I can't run "hbase shell" when I'm in this mode, probably for the same reason. 此外,当我处于这种模式时,我无法运行“hbase shell”,可能是出于同样的原因。

So basically, if I want to flip back and forth between "debug mode" and "normal mode" , I need to update hadoop-env.sh and restart my machine. 所以基本上,如果我想在“调试模式”和“正常模式”之间来回切换 ,我需要更新hadoop-env.sh并重新启动我的机器。 Major pain. 主要的痛苦。 So I have a few questions : 所以我有几个问题:

  1. Is there an easier way to do debug mapreduce jobs in eclipse? 有没有更简单的方法在eclipse中调试mapreduce作业?

  2. How come eclipse can run my mapreduce jobs just fine, but for debugging I need to use remote debugging? 为什么eclipse可以很好地运行我的mapreduce作业,但是对于调试我需要使用远程调试?

  3. Is there a way to tell hadoop to use remote debugging for mapreduce jobs, but to operate in normal mode for all other tasks? 有没有办法告诉hadoop使用远程调试mapreduce作业,但是在正常模式下操作所有其他任务? (such as "hadoop queue" or "hbase shell" ). (例如“hadoop queue”“hbase shell” )。

  4. Is there an easier way to switch hadoop-env.sh configurations without rebooting my machine? 有没有更简单的方法来切换hadoop-env.sh配置而无需重新启动我的机器? hadoop-env.sh is not executable by default. hadoop-env.sh默认情况下不可执行。

  5. This is a more general question : what exactly is happening when I run hadoop in local-only mode? 这是一个更普遍的问题:当我在仅本地模式下运行hadoop时究竟发生了什么? Are there any processes on my machine that are "always on" and executing hadoop jobs? 我的机器上是否有任何“始终打开”并执行hadoop作业的进程? Or does hadoop only do things when I run the "hadoop" command from the command line? 或者,当我从命令行运行“hadoop”命令时,hadoop只执行操作吗? What is eclipse doing when I run a mapreduce job from eclipse? 当我从eclipse运行mapreduce工作时,eclipse正在做什么? I had to reference hadoop-core in my pom.xml in order to make my project work. 我必须在我的pom.xml中引用hadoop-core才能使我的项目工作。 Is eclipse submitting jobs to my installed hadoop instance, or is it somehow running it all from the hadoop-core-1.0.0.jar in my maven cache? eclipse是否将作业提交给我已安装的hadoop实例,还是以某种方式从我的maven缓存中的hadoop-core-1.0.0.jar运行它?

Here is my Main class : 这是我的主要课程:

public class Main {
      public static void main(String[] args) throws Exception {     
        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("FirstStage");

        FileInputFormat.addInputPath(job, new Path("/home/sangfroid/project/in"));
        FileOutputFormat.setOutputPath(job, new Path("/home/sangfroid/project/out"));

        job.setMapperClass(FirstStageMapper.class);
        job.setReducerClass(FirstStageReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
}

Make changes in /bin/hadoop ( hadoop-env.sh ) script. /bin/hadoophadoop-env.sh )脚本中进行更改。 Check to see what command has been fired. 检查已触发的命令。 If the command is jar , then only add remote debug configuration. 如果命令是jar ,则只添加远程调试配置。

if [ "$COMMAND" = "jar" ] ; then
  exec "$JAVA" -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8999 $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
else
  exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
fi

The only way you can debug hadoop in eclipse is running hadoop in local mode. 在eclipse中调试hadoop的唯一方法是在本地模式下运行hadoop。 The reason being, each map reduce task run in ist own JVM and when you don't hadoop in local mode, eclipse won't be able to debug. 原因是,每个map reduce任务在ist自己的JVM中运行,当你没有在本地模式下运行时,eclipse将无法调试。

When you set hadoop to local mode, instead of using hdfs API (which is default), hadoop file system changes to file:/// . 当您将hadoop设置为本地模式时,而不是使用hdfs API (这是默认值),hadoop文件系统将更改为file:/// Thus, running hadoop fs -ls will not be a hdfs command, but more of hadoop fs -ls file:/// , a path to your local directory. 因此,运行hadoop fs -ls将不是hdfs命令,而是更多的hadoop fs -ls file:/// ,这是本地目录的路径。 None of the JobTracker or NameNode runs. JobTracker或NameNode都不会运行。

These blogposts might help: 这些博文可能会有所帮助:

Jumbune's debugger will do all these with minimal effort. Jumbune的调试器将以最小的努力完成所有这些工作。

The debugger provides code level control flow statistics of the MapReduce job. 调试器提供MapReduce作业的代码级控制流统计信息。

User may apply regex validations or its own user defined validation classes. 用户可以应用正则表达式验证或其自己的用户定义的验证类。 As per validations applied, Flow Debugger checks the flow of data for mapper and reducer respectively. 根据应用的验证,Flow Debugger分别检查mapper和reducer的数据流。

It also provides a comprehensive table/chart view where the flow of input records is displayed at job level, MR level, and instance level. 它还提供了一个全面的表/图表视图,其中输入记录流在作业级别,MR级别和实例级别显示。 Unmatched keys/values represent the number of erroneous key/value data in job execution result. 不匹配的键/值表示作业执行结果中错误的键/值数据的数量。 Debugger drills down into the code to examine the flow of data for various counters like loops and conditions if, else-if, etc. 调试器深入到代码中来检查各种计数器的数据流,如循环和条件if,else-if等。

Jumbune is open source and available at www.jumbune.org and https://github.com/impetus-opensource/jumbune Jumbune是开源的,可在www.jumbune.org和https://github.com/impetus-opensource/jumbune获取。

Besides the recommended MRUnit I like to debug with eclipse as well. 除了推荐的MRUnit,我也喜欢用eclipse进行调试。 I have a main program. 我有一个主程序。 It instantiates a Configuration and executes the MapReduce job directly. 它实例化一个Configuration并直接执行MapReduce作业。 I just debug with standard eclipse Debug configurations. 我只是使用标准的eclipse Debug配置进行调试。 Since I include hadoop jars in my mvn spec, I have all hadoop per se in my class path and I have no need to run it against my installed hadoop. 因为我在我的mvn规范中包含了hadoop jar,所以我在我的类路径中都有hadoop本身,我没有必要在我安装的hadoop上运行它。 I always test with small data sets in local directories to make things easy. 我总是使用本地目录中的小数据集进行测试,以简化操作。 The defaults for the configuration behaves as a stand alone hadoop (file system is available) 配置的默认值表现为独立的hadoop(文件系统可用)

I also like to debug via unit test w/MRUnit. 我也喜欢通过MRUnit的单元测试进行调试。 I will use this in combination with approvaltests which creates an easy visualization of the Map Reduce process, and makes it easy to pass in scenarios that are failing. 我将结合批准测试使用它,这可以轻松地显示Map Reduce过程,并且可以轻松传递失败的场景。 It also runs seamlessly from eclipse. 它也可以从eclipse无缝运行。

For example: 例如:

HadoopApprovals.verifyMapReduce(new WordCountMapper(), 
                         new WordCountReducer(), 0, "cat cat dog");

Will produce the output: 会产生输出:

[cat cat dog] 
-> maps via WordCountMapper to ->
(cat, 1) 
(cat, 1) 
(dog, 1)

-> reduces via WordCountReducer to ->
(cat, 2) 
(dog, 1)

There's a video on the process here: http://t.co/leExFVrf 这里有一个关于这个过程的视频: http//t.co/leExFVrf

向hadoop的内部java命令添加args可以通过HADOOP_OPTS env变量完成:

export HADOOP_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM