简体   繁体   English

Hadoop集群卡在Reduce> copy>上

[英]Hadoop cluster stuck hangs on Reduce > copy >

So far for this issue I have tried solutions from here, 1 , and here, 2 . 到目前为止,我已经尝试过这里的解决方案1 ,这里, 2 However, while these solutions do result in the mapreduce task being carried out, it would appear they only run on the name node as I get output similar to here, 3 . 然而,虽然这些解决方案确实导致执行mapreduce任务,但看起来它们只能在名称节点上运行,因为我得到类似于此处的输出, 3

Basically, I am running a 2 node cluster with a mapreduce algorithm that I have designed myself. 基本上,我正在使用我自己设计的mapreduce算法运行2节点集群 The mapreduce jar is executed perfectly on a single node cluster , which leads me to think that there is something wrong with my hadoop multi-node configuration . mapreduce jar 在单个节点集群上完美执行 ,这让我认为我的hadoop多节点配置有问题 To set up multi-node, I followed the tutorial here . 要设置多节点,我按照这里的教程

To report what is going wrong, when I execute my program (after checking that namenodes, tasktrackers, jobtrackers, and Datanodes are running on the respective nodes) my program halts with this line in terminal : 要报告出错的地方,当我执行我的程序时(在检查了名称节点,任务分析器,工作分析器和Datanode正在相应的节点上运行之后),我的程序在终端中使用此行停止

INFO mapred.JobClient: map 100% reduce 0%

If I take a look at the logs for the task I see copy failed: attempt... from slave-node followed by a SocketTimeoutException . 如果我查看任务日志,我看到copy failed: attempt... from slave-node后跟一个SocketTimeoutException

Taking a look at the logs on my slave-node (DataNode) shows that the execution halts at the following line : 查看我的从属节点 (DataNode)上的日志显示执行在以下行停止

TaskTracker: attempt... 0.0% reduce > copy >

as the solutions in links 1 and 2 suggest, removing various ip addresses from the etc/hosts file results in successful execution , however I end up with items such as in link 4 in my slave-node (DataNode) log , for example: 如链接1和链接2中的解决方案所示, etc/hosts文件中删除各种IP地址会导致成功执行 ,但是我最终会在我的slave-node(DataNode)日志中的链接4中找到项目,例如:

INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction' for job: job_201201301055_0381

WARN org.apache.hadoop.mapred.TaskTracker: Unknown job job_201201301055_0381 being deleted.

This looks suspect to me , as a new hadoop user , but it may be perfectly normal to see this. 作为一个新的hadoop用户 ,这看起来很可疑 ,但看到这一点可能是完全正常的。 To me this looks as though something was pointing to the incorrect ip address in the hosts file , and that by removing this ip address I simply halt execution on the slave-node , and processing continues on the namenode instead (which isn't really advantageous at all). 对我来说,这看起来好像指向了hosts文件中不正确的IP地址 ,并且通过删除此IP地址,我只是暂停从属节点上的执行 ,并继续在namenode上处理(这不是很有利)在所有)。

To sum up: 总结一下:

  1. Is this output expected? 这是预期的输出吗?
  2. Is there a way I can see what was executed on what node post-execution? 有没有办法可以看到在执行后的哪个节点上执行了什么?
  3. Can anybody spot anything that I may have done wrong? 任何人都可以发现我可能做错的事吗?

EDIT added hosts and config files for each node EDIT为每个节点添加了主机和配置文件

Master: etc/hosts 主人:等等/主持人

127.0.0.1       localhost
127.0.1.1       joseph-Dell-System-XPS-L702X

#The following lines are for hadoop master/slave setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Slave: etc/hosts 奴隶:etc / hosts

127.0.0.1       localhost
127.0.1.1       joseph-Home # this line was incorrect, it was set as 7.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Master: core-site.xml Master:core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hduser/tmp</value>
    <description>A base for other temporary directories.</description>
</property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

Slave: core-site.xml Slave:core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hduser/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>

</configuration>

Master: hdfs-site.xml 主人:hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

Slave: hdfs-site.xml Slave:hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

Master: mapred-site.xml Master:mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>
</configuration>

Slave: mapre-site.xml Slave:mapre-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>

</configuration>

The error is in etc/hosts: 错误发生在etc / hosts中:

During the erroneous runs, the slave etc/hosts file looked like this: 在错误运行期间,slave etc / hosts文件看起来像这样:

127.0.0.1       localhost
7.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

As you may have spotted, the ip address of this computer 'joseph-Home' was incorrectly configured. 您可能已经发现,此计算机'joseph-Home'的IP地址配置错误。 It was set to 7.0.1.1, when it should be set to 127.0.1.1. 它设置为7.0.1.1,应设置为127.0.1.1。 Therefore, changing the slave etc/hosts file, line 2, to 127.0.1.1 joseph-Home fixed the issue, and my logs appear normally on the slave node. 因此,将slave etc / hosts文件第2行更改为127.0.1.1 joseph-Home修复了该问题,并且我的日志正常显示在从属节点上。

New etc/hosts file: 新的etc / hosts文件:

127.0.0.1       localhost
127.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Tested solution is to add below property to hadoop-env.sh and Restart All hadoop cluster services 经过测试的解决方案是将以下属性添加到hadoop-env.sh和Restart All hadoop集群服务

hadoop-env.sh hadoop-env.sh

export HADOOP_CLIENT_OPTS="-Xmx2048m $HADOOP_CLIENT_OPTS" export HADOOP_CLIENT_OPTS =“ - Xmx2048m $ HADOOP_CLIENT_OPTS”

I also meet this problem today. 我今天也遇到了这个问题。 The issue in my case is that the disk of one node in the cluster is full, so hadoop cannot write log file to local disk, so a possible solution to this problem can be deleting some unused files on the local disk. 在我的情况下的问题是群集中的一个节点的磁盘已满,因此hadoop无法将日志文件写入本地磁盘,因此解决此问题的可能方法是删除本地磁盘上的一些未使用的文件。 Hope it helps 希望能帮助到你

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM