简体繁体 English

如何明确定义datanode在HDFS中存储特定的给定文件？

[英]How to explicilty define datanodes to store a particular given file in HDFS?

原文 2012-05-30 06:11:53 8 2 hadoop/ hdfs

I want to write a script or something like .xml file which explicitly defines the datanodes in Hadoop cluster to store a particular file blocks. 我想编写一个脚本或类似.xml的文件，该文件明确定义Hadoop集群中的datanode以存储特定的文件块。 for example: Suppose there are 4 slave nodes and 1 Master node (total 5 nodes in hadoop cluster ). 例如：假设有4个从属节点和1个主节点（hadoop集群中总共5个节点）。 there are two files file01(size=120 MB) and file02(size=160 MB).Default block size =64MB 有两个文件file01（size = 120 MB）和file02（size = 160 MB）。默认块大小= 64MB

Now I want to store one of two blocks of file01 at slave node1 and other one at slave node2. 现在，我想将文件01的两个块之一存储在从属节点1中，并将另一个存储在从属节点2中。 Similarly one of three blocks of file02 at slave node1, second one at slave node3 and third one at slave node4. 类似地，file02的三个块之一位于从属节点1，第二个位于从属节点3，第三个位于从属节点4。 So,my question is how can I do this ? 所以，我的问题是我该怎么做？

actually there is one method :Make changes in conf/slaves file every time to store a file. 实际上有一种方法：每次在conf / slaves文件中进行更改以存储文件。 but I don't want to do this So, there is another solution to do this ?? 但是我不想这样做，所以，还有另一种解决方案？ I hope I made my point clear. 我希望我说清楚。 Waiting for your kind response..!!! 等待您的友好答复。

2 个解决方案

There is no method to achieve what you are asking here - the Name Node will replicate blocks to data nodes based upon rack configuration, replication factor and node availability, so even if you do managed to get a block on two particular data nodes, if one of those nodes goes down, the name node will replicate the block to another node. 没有方法可以满足您的要求-名称节点将根据机架配置，复制因子和节点可用性将块复制到数据节点，因此即使您确实设法在两个特定数据节点上获取了一个块，这些节点中的第一个发生故障时，名称节点会将块复制到另一个节点。

Your requirement is also assuming a replication factor of 1, which doesn't give you any data redundancy (which is a bad thing if you lose a data node). 您的要求还假设复制因子为1，这不会给您带来任何数据冗余（如果丢失数据节点，这是一件坏事）。

Let the namenode manage block assignments and use the balancer periodically if you want to keep your cluster evenly distibuted 如果要保持群集均匀分布，请让namenode管理块分配并定期使用平衡器

NameNode is an ultimate authority to decide on the block placement. NameNode是决定块放置的最终授权。 There is Jira about the requirements to make this algorithm pluggable: https://issues.apache.org/jira/browse/HDFS-385 关于使该算法可插入的要求，有Jira： https : //issues.apache.org/jira/browse/HDFS-385
but unfortunetely it is in the 0.21 version, which is not production (alhough working not bad at all). 但不幸的是，它是0.21版本，不是生产版本（尽管工作情况还不错）。
I would suggest to plug you algorithm to 0.21 if you are on the research state and then wait for 0.23 to became production, or, to downgrade the code to 0.20 if you do need it now. 如果您处于研究状态，我建议您将算法插入0.21，然后等待0.23投入生产，或者如果现在确实需要，则将代码降级为0.20。