簡體   English   中英

Hadoop集群卡在Reduce> copy>上

[英]Hadoop cluster stuck hangs on Reduce > copy >

到目前為止,我已經嘗試過這里的解決方案1 ,這里, 2 然而,雖然這些解決方案確實導致執行mapreduce任務,但看起來它們只能在名稱節點上運行,因為我得到類似於此處的輸出, 3

基本上,我正在使用我自己設計的mapreduce算法運行2節點集群 mapreduce jar 在單個節點集群上完美執行 ,這讓我認為我的hadoop多節點配置有問題 要設置多節點,我按照這里的教程

要報告出錯的地方,當我執行我的程序時(在檢查了名稱節點,任務分析器,工作分析器和Datanode正在相應的節點上運行之后),我的程序在終端中使用此行停止

INFO mapred.JobClient: map 100% reduce 0%

如果我查看任務日志,我看到copy failed: attempt... from slave-node后跟一個SocketTimeoutException

查看我的從屬節點 (DataNode)上的日志顯示執行在以下行停止

TaskTracker: attempt... 0.0% reduce > copy >

如鏈接1和鏈接2中的解決方案所示, etc/hosts文件中刪除各種IP地址會導致成功執行 ,但是我最終會在我的slave-node(DataNode)日志中的鏈接4中找到項目,例如:

INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction' for job: job_201201301055_0381

WARN org.apache.hadoop.mapred.TaskTracker: Unknown job job_201201301055_0381 being deleted.

作為一個新的hadoop用戶 ,這看起來很可疑 ,但看到這一點可能是完全正常的。 對我來說,這看起來好像指向了hosts文件中不正確的IP地址 ,並且通過刪除此IP地址,我只是暫停從屬節點上的執行 ,並繼續在namenode上處理(這不是很有利)在所有)。

總結一下:

  1. 這是預期的輸出嗎?
  2. 有沒有辦法可以看到在執行后的哪個節點上執行了什么?
  3. 任何人都可以發現我可能做錯的事嗎?

EDIT為每個節點添加了主機和配置文件

主人:等等/主持人

127.0.0.1       localhost
127.0.1.1       joseph-Dell-System-XPS-L702X

#The following lines are for hadoop master/slave setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

奴隸:etc / hosts

127.0.0.1       localhost
127.0.1.1       joseph-Home # this line was incorrect, it was set as 7.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Master:core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hduser/tmp</value>
    <description>A base for other temporary directories.</description>
</property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

Slave:core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hduser/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>

</configuration>

主人:hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

Slave:hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

Master:mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>
</configuration>

Slave:mapre-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>

</configuration>

錯誤發生在etc / hosts中:

在錯誤運行期間,slave etc / hosts文件看起來像這樣:

127.0.0.1       localhost
7.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

您可能已經發現,此計算機'joseph-Home'的IP地址配置錯誤。 它設置為7.0.1.1,應設置為127.0.1.1。 因此,將slave etc / hosts文件第2行更改為127.0.1.1 joseph-Home修復了該問題,並且我的日志正常顯示在從屬節點上。

新的etc / hosts文件:

127.0.0.1       localhost
127.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

經過測試的解決方案是將以下屬性添加到hadoop-env.sh和Restart All hadoop集群服務

hadoop-env.sh

export HADOOP_CLIENT_OPTS =“ - Xmx2048m $ HADOOP_CLIENT_OPTS”

我今天也遇到了這個問題。 在我的情況下的問題是群集中的一個節點的磁盤已滿,因此hadoop無法將日志文件寫入本地磁盤,因此解決此問題的可能方法是刪除本地磁盤上的一些未使用的文件。 希望能幫助到你

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM