简体   繁体   English

AWS EMR:主节点是否在EMR集群中存储hdfs数据?

[英]AWS EMR: Does master node stores hdfs data in EMR cluster?

master node - does this node stores hdfs data in aws emr cluster?主节点 - 该节点是否在 aws emr 集群中存储 hdfs 数据? task node - if this node does not store hdfs data, is it purely computational node?任务节点——如果这个节点没有存储hdfs个数据,是纯计算节点吗? in this case does hadoop transfer to task node?在这种情况下,hadoop 是否转移到任务节点? does this not defeat data localization computation advantgae?这不会破坏数据本地化计算优势吗?

(Other than the edge case of a master-only cluster with no core or task instances...) (除了没有核心或任务实例的仅主集群的边缘情况......)

The master instance does not store any HDFS data, nor does it act as a computational node.主实例不存储任何 HDFS 数据,也不作为计算节点。 The master instance runs services like the YARN ResourceManager and HDFS NameNode.主实例运行 YARN ResourceManager 和 HDFS NameNode 等服务。

The only nodes that store data are those that run HDFS DataNode, which are only the core instances.唯一存储数据的节点是那些运行 HDFS DataNode 的节点,它们只是核心实例。

The core and task instances both run YARN NodeManager and thus are the "computational nodes".核心和任务实例都运行 YARN NodeManager,因此是“计算节点”。

Regarding your question, "in this case does hadoop transfer to task node", I assume that you are asking whether or not Hadoop transfers (HDFS) data to the task instances so that they may perform computations on HDFS data.关于您的问题,“在这种情况下,hadoop 是否传输到任务节点”,我假设您是在询问 Hadoop 是否将(HDFS)数据传输到任务实例,以便它们可以对 HDFS 数据执行计算。 In a sense, yes, task instances may read HDFS blocks remotely from core instances where the blocks are stored.从某种意义上说,是的,任务实例可以从存储块的核心实例远程读取 HDFS 个块。

It's true that this means that task instances can never take advantage of data locality for HDFS data, but there are many cases where this does not matter anyway, such as for tasks that are read shuffle data from other nodes, or tasks that are reading data from remote storage anyway (eg, Amazon S3).的确,这意味着任务实例永远无法利用 HDFS 数据的数据局部性,但是在很多情况下这无关紧要,例如对于从其他节点读取混洗数据的任务,或者正在读取数据的任务无论如何从远程存储(例如,Amazon S3)。 Furthermore, depending upon the core instance type being used, keep in mind that even the HDFS blocks might be getting stored in remote storage (ie, EBS).此外,根据所使用的核心实例类型,请记住,即使是 HDFS 块也可能存储在远程存储(即 EBS)中。 That said, even when your task instances are reading data from a remote DataNode or remote service like S3 or EBS, it might not even be noticeable to the point that you need to worry about data locality.也就是说,即使您的任务实例正在从远程 DataNode 或远程服务(如 S3 或 EBS)读取数据,它甚至可能不会引起您的注意,以至于您需要担心数据局部性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Application Master是否一直运行在EMR集群的master节点 - Does Spark Application Master always run in the master node of EMR cluster or not 带有 Flink 的 AWS EMR 集群不运行任何 Jar,而是给出 java.lang.NoSuchMethodError - AWS EMR cluster with Flink does not run any Jar, instead gives java.lang.NoSuchMethodError 从 EMR 集群主机外部使用 spark-submit - Using spark-submit externally from EMR cluster master 限制 EMR 集群历史记录? - Limit on EMR Cluster history? AWS EMR 上的 Presto 沙盒集群 - 添加连接器 (catalog/.properties) - Presto Sandbox cluster on AWS EMR - add connector (catalog/.properties) 如何在 EMR 集群 AWS 中使用 java runtime 11 - How to use java runtime 11 in EMR cluster AWS AWS EMR jupyter 错误 403 Forbidden (Workspace is not attached to cluster) - AWS EMR jupyter error 403 Forbidden (Workspace is not attached to cluster) Terraform AWS EMR HBase 集群创建 - 应用程序配置超时 - Terraform AWS EMR HBase cluster creation - application provisioning timed out 无法在 AWS EMR 中使用 Pyspark 或 Python 从 mongoDB 读取数据 - Unable to read data from mongoDB using Pyspark or Python in AWS EMR 创建EMR集群出错,EMR服务角色无效 - Error when creating EMR cluster, EMR service role is invalid
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM