简体繁体 English

将文件/块从 HDFS 复制到从节点的本地文件系统

[英]Copy files/chunks from HDFS to local file system of slave nodes

原文 2015-09-13 05:45:36 8 4 hadoop/ mapreduce/ hdfs/ distributed-computing

In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).在Hadoop中，我理解主节点（Namenode）负责存储从机（Datanode）中的数据块。

When we use -copyToLocal or -get , from the master, the files could be copied from the HDFS to the local storage of the master node.当我们使用-copyToLocal或-get ，从主节点，文件可以从 HDFS 复制到主节点的本地存储。 Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?从站有什么办法可以将存储在其中的块（数据）复制到自己的本地文件系统中？

For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each.例如，一个 128 MB 的文件可以被分成 2 个从节点，每个节点存储 64MB。 Is there any way for the slave to identify and load this chunk of data to its local file system?从站有没有办法识别这块数据并将其加载到其本地文件系统？ If so, how can this be done programmatically?如果是这样，这如何以编程方式完成？ Can the commands -copyToLocal or -get be used in this case also?在这种情况下也可以使用命令-copyToLocal或-get吗？ Please help.请帮忙。

4 个解决方案

Short Answer: No简短回答：否

The data/files cannot be copied directly from Datandode 's.不能直接从Datandode复制data/files 。 The reason is, Datanodes store the data but they don't have any metadata information about the stored files.原因是， Datanodes存储数据，但它们没有关于存储文件的任何元数据信息。 For them, they are just block of bits and bytes.对他们来说，它们只是位和字节块。 The metadata of the files is stored in the Namenode .文件的元数据存储在Namenode 。 This metadata contains all the information about the files (name, size, etc.).此元数据包含有关文件的所有信息（名称、大小等）。 Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes .与此同时， Namenode会跟踪文件的哪些块存储在哪些Datanodes 。 The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.当实际文件被分成多个块时， Datanodes也不知道块的顺序。

Can the commands -copyToLocal or -get be used in this case also?在这种情况下也可以使用命令 -copyToLocal 或 -get 吗？

Yes, you can simply run these from the slave.是的，你可以简单地从奴隶运行这些。 The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.然后从站将联系 namenode（如果您已正确配置）并将数据下载到您的本地文件系统。

What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories.它不做的是“短路”复制，它只会复制目录之间的原始块。 There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.也不能保证它会从本地机器读取块，因为您的命令行客户端不知道它的位置。

HDFS blocks are stored on the slaves local FS only . HDFS 块仅存储在从属本地 FS 上。 you can dig down the directory defined under property "dfs.datanode.dir" But you wont get any benefit of reading blocks directly (without HDFS API).您可以挖掘属性“dfs.datanode.dir”下定义的目录，但您不会直接读取块（没有HDFS API）获得任何好处。 Also reading and editing block files directory can corrupt the file on HDFS.此外，读取和编辑块文件目录可能会损坏 HDFS 上的文件。

If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).如果您想将数据存储在不同的从属本地，那么您将必须实现维护块元数据的逻辑（已在 Namenode 中编写并为您完成）。

Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?当 Hadoop 处理分布式数据面临的所有挑战时，您能否详细说明为什么要自己分配块？

You can copy particular file or directory from one slave to another slave by using distcp您可以使用distcp将特定文件或目录从一个从站复制到另一个从站

Usage: distcp slave1address slave2address用法： distcp slave1address slave2address