简体   繁体   中英

HDFS file blocks distribution in two node cluster

Environment

Hadoop : 0.20.205.0
Number of machines in cluster : 2 nodes
Replication : set to 1
DFS Block size : 1MB

I put a 7.4MB file into HDFS using put command. I run fsck command to check the blocks distribution of the file among the datanodes. I see that all the 8 blocks of the file are going to only one node. This affects the load distribution and only one node always get used while running mapred tasks.

Is there a way that I can distribute the files to more than one datanode?

bin/hadoop dfsadmin -report
Configured Capacity: 4621738717184 (4.2 TB)
Present Capacity: 2008281120783 (1.83 TB)
DFS Remaining: 2008281063424 (1.83 TB)
DFS Used: 57359 (56.01 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (6 total, 4 dead)

Name: 143.215.131.246:50010
Decommission Status : Normal
Configured Capacity: 2953506713600 (2.69 TB)
DFS Used: 28687 (28.01 KB)
Non DFS Used: 1022723801073 (952.49 GB)
DFS Remaining: 1930782883840(1.76 TB)
DFS Used%: 0%
DFS Remaining%: 65.37%
Last contact: Fri Jul 18 10:31:51 EDT 2014

bin/hadoop fs -put /scratch/rkannan3/hadoop/test/pg20417.txt /user/rkannan3

bin/hadoop fs -ls /user/rkannan3
Found 1 items
-rw-------   1 rkannan3 supergroup    7420270 2014-07-18 10:40 /user/rkannan3/pg20417.txt

bin/hadoop fsck /user/rkannan3 -files -blocks -locations
FSCK started by rkannan3 from /143.215.131.246 for path /user/rkannan3 at Fri Jul 18 10:43:13 EDT 2014
/user/rkannan3 <dir>
/user/rkannan3/pg20417.txt 7420270 bytes, 8 block(s):  OK <==== All the 8 blocks in one DN
0. blk_3659272467883498791_1006 len=1048576 repl=1 [143.215.131.246:50010]
1. blk_-5158259524162513462_1006 len=1048576 repl=1 [143.215.131.246:50010]
2. blk_8006160220823587653_1006 len=1048576 repl=1 [143.215.131.246:50010]
3. blk_4541732328753786064_1006 len=1048576 repl=1 [143.215.131.246:50010]
4. blk_-3236307221351862057_1006 len=1048576 repl=1 [143.215.131.246:50010]
5. blk_-6853392225410344145_1006 len=1048576 repl=1 [143.215.131.246:50010]
6. blk_-2293710893046611429_1006 len=1048576 repl=1 [143.215.131.246:50010]
7. blk_-1502992715991891710_1006 len=80238 repl=1 [143.215.131.246:50010]

If you want to have distribution on file level use at least a replication factor of 2. The first replica is always placed where the writer is located (see the introduction paragraph in http://waset.org/publications/16836/optimizing-hadoop-block-placement-policy-and-cluster-blocks-distribution ); and normally one file has only one writer, so the first replica of several blocks of a file will always be on that node. You probably don't want to change that behaviour, because you want to have the option available to increase the minimum split size when you want to avoid spawning too many mappers without losing the data locality for mappers.

You must use Hadoop balancer command. Details below. Tutorials link

Balancer

Runs a cluster balancing utility. You can simply press Ctrl-C to stop the rebalancing process. Please find more details here

   Usage: hadoop balancer [-threshold <threshold>]

   -threshold <threshold>   Percentage of disk capacity. This overwrites the default threshold.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM