简体   繁体   English

甚至在hadoop / hive上分发数据

[英]Even data distribution on hadoop/hive

I am trying a small hadoop setup (for experimentation) with just 2 machines. 我正在尝试仅2台机器进行小型hadoop设置(用于实验)。 I am loading about 13GB of data, a table of around 39 million rows, with a replication factor of 1 using Hive. 我正在加载约13GB的数据,约3,900万行的表,使用Hive的复制因子为1。

My problem is hadoop always stores all this data on a single datanode. 我的问题是hadoop总是将所有这些数据存储在单个datanode上。 Only if I change the dfs_replication factor to 2 using setrep, hadoop copies data on the other node. 仅当我使用setrep将dfs_replication因子更改为2时,hadoop才会在另一个节点上复制数据。 I also tried the balancer ( $HADOOP_HOME/bin/start-balancer.sh -threshold 0 ). 我还尝试了平衡器( $HADOOP_HOME/bin/start-balancer.sh -threshold 0 )。 The balancer recognizes that it needs to move around 5GB to balance. 平衡器意识到需要平衡约5GB的空间。 But says: No block can be moved. Exiting... 但是说: No block can be moved. Exiting... No block can be moved. Exiting... and exits: No block can be moved. Exiting...并退出:

2010-07-05 08:27:54,974 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 0.0
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.252.130.177:1036
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over utilized nodes: 10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 under utilized nodes:  10.252.130.177:1036
2010-07-05 08:27:56,997 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 5.42 GB bytes to make the cluster balanced.

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
No block can be moved. Exiting...
Balancing took 2.222 seconds

Can anybody suggest how to achieve even distribution of data on hadoop, without replication? 有人可以建议如何在不复制的情况下在hadoop上实现均匀的数据分配吗?

are you using both your machines as datanodes? 您是否将两台计算机都用作数据节点? Highly unlikely but you can confirm this for me. 可能性很小,但您可以为我确认。

Typically in a 2 machine cluster, I'd expect one machine to be the namenode and the other one to be the datanode. 通常,在两台计算机群集中,我希望一台计算机是namenode,另一台计算机是datanode。 So the fact that when you set the replication factor to 1, the data gets copied to the only datanode available to it. 因此,当您将复制因子设置为1时,数据将被复制到唯一可用的数据节点。 If you change it to 2, it may look for another datanode in the cluster to copy the data to but won't find it and hence it may exit. 如果将其更改为2,它可能会在集群中寻找另一个数据节点来将数据复制到该节点,但找不到它,因此可能会退出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM