简体繁体 English

AWS EMR：主节点是否在EMR集群中存储hdfs数据？

[英]AWS EMR: Does master node stores hdfs data in EMR cluster?

原文 2022-03-28 16:04:01 9 1 amazon-web-services/ amazon-emr

master node - does this node stores hdfs data in aws emr cluster?主节点 - 该节点是否在 aws emr 集群中存储 hdfs 数据？ task node - if this node does not store hdfs data, is it purely computational node?任务节点——如果这个节点没有存储hdfs个数据，是纯计算节点吗？ in this case does hadoop transfer to task node?在这种情况下，hadoop 是否转移到任务节点？ does this not defeat data localization computation advantgae?这不会破坏数据本地化计算优势吗？

1 个解决方案

(Other than the edge case of a master-only cluster with no core or task instances...) （除了没有核心或任务实例的仅主集群的边缘情况......）

The master instance does not store any HDFS data, nor does it act as a computational node.主实例不存储任何 HDFS 数据，也不作为计算节点。 The master instance runs services like the YARN ResourceManager and HDFS NameNode.主实例运行 YARN ResourceManager 和 HDFS NameNode 等服务。

The only nodes that store data are those that run HDFS DataNode, which are only the core instances.唯一存储数据的节点是那些运行 HDFS DataNode 的节点，它们只是核心实例。

The core and task instances both run YARN NodeManager and thus are the "computational nodes".核心和任务实例都运行 YARN NodeManager，因此是“计算节点”。

Regarding your question, "in this case does hadoop transfer to task node", I assume that you are asking whether or not Hadoop transfers (HDFS) data to the task instances so that they may perform computations on HDFS data.关于您的问题，“在这种情况下，hadoop 是否传输到任务节点”，我假设您是在询问 Hadoop 是否将（HDFS）数据传输到任务实例，以便它们可以对 HDFS 数据执行计算。 In a sense, yes, task instances may read HDFS blocks remotely from core instances where the blocks are stored.从某种意义上说，是的，任务实例可以从存储块的核心实例远程读取 HDFS 个块。

It's true that this means that task instances can never take advantage of data locality for HDFS data, but there are many cases where this does not matter anyway, such as for tasks that are read shuffle data from other nodes, or tasks that are reading data from remote storage anyway (eg, Amazon S3).的确，这意味着任务实例永远无法利用 HDFS 数据的数据局部性，但是在很多情况下这无关紧要，例如对于从其他节点读取混洗数据的任务，或者正在读取数据的任务无论如何从远程存储（例如，Amazon S3）。 Furthermore, depending upon the core instance type being used, keep in mind that even the HDFS blocks might be getting stored in remote storage (ie, EBS).此外，根据所使用的核心实例类型，请记住，即使是 HDFS 块也可能存储在远程存储（即 EBS）中。 That said, even when your task instances are reading data from a remote DataNode or remote service like S3 or EBS, it might not even be noticeable to the point that you need to worry about data locality.也就是说，即使您的任务实例正在从远程 DataNode 或远程服务（如 S3 或 EBS）读取数据，它甚至可能不会引起您的注意，以至于您需要担心数据局部性。