简体   繁体   English

HDFS中用于异构Hadoop集群的数据放置和分发

[英]Data placement and distribution in HDFS for heterogeneous Hadoop cluster

I have installed Apache Hadoop 2.x with 5 heterogeneous nodes among which one node is purely dedicated to NameNode. 我已经安装了具有5个异构节点的Apache Hadoop 2.x ,其中一个节点专门用于NameNode。

I am using below command to put my input files into HDFS . 我正在使用以下命令将输入​​文件放入HDFS

$ hdfs dfs -put /home/hduser/myspace/data /user/hduser/inputfile

HDFS replicates this input file on three DataNodes (DN) , it means one 4th DataNode is not having input block. HDFS将此输入文件复制到三个DataNodes (DN) ,这意味着第四个DataNode没有输入块。 If I use 8 mappers (by setting the split size using NLineInputFormat() method), then will these 8 mappers be assigned to all 4 DNs. 如果我使用8个映射器(通过使用NLineInputFormat()方法设置拆分大小),那么会将这8个映射器分配给所有4个DN。 I think it should be. 我认为应该是。 In that case data block from other DNs will move to 4th DN to be computed by mappers assigned to it, which increase the overall execution time. 在这种情况下,来自其他DN的数据块将移至第4个DN,由分配给它的映射器进行计算,这会增加总体执行时间。

My questions are: 我的问题是:

  1. Can we somehow manage to place data blocks on each DNs so that there is no need to move data for mappers on a particular DN. 我们能否以某种方式设法将数据块放置在每个DN上,从而无需在特定DN上移动映射器的数据。 Can it accomplish by "put" command of hdfs? 它可以通过hdfs的“ put”命令完成吗?

  2. Also in case of heterogeneous clusters, can we put different size of data on different DNs depending on nodes's computing power? 同样在异构集群的情况下,是否可以根据节点的计算能力将不同大小的数据放在不同的DN上?

We cannot manage to place the data blocks on each DN.You mentioned HDFS replicates file to 3 DNs. 我们无法在每个DN上放置数据块。您提到了HDFS将文件复制到3个DN。 And this is true only if your file size is less than the block size. 仅当文件大小小于块大小时,这才是正确的。 HDFS replicates data by dividing a file into multiple blocks. HDFS通过将文件分成多个块来复制数据。 So, there is a greater probability that the file data(blocks) are spread across all the 4 DNs. 因此,文件数据(块)分布在所有4个DN中的可能性更大。

The block placement totally depends on hadoop and it will manage the blocks placement internally you can only configure the number of replication by 块放置完全取决于hadoop,它将在内部管理块放置,您只能通过以下方式配置复制数量:

dfs.replication.factor dfs.replication.factor

or size by 或大小

dfs.block.size dfs.block.size

of block to accomplish what you desire. 完成您的期望。

If you want to check the block placement you can open the Web UI of HDFS which is 如果要检查块放置,可以打开HDFS的Web UI,即

Namenode:50070 名称节点:50070

and browse to the file here it will show you the blocks placement among all the nodes. 并浏览到此处的文件,它将向您显示所有节点之间的块放置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM