简体   繁体   English

调用sc.textFile(“ hdfs://…”)时,spark和hdfs之间将建立多少个bt连接

[英]How many connections will be build btween spark and hdfs when sc.textFile(“hdfs://…”) is called

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called. 调用sc.textFile(“ hdfs:// .....”)时,spark和hdfs之间将建立多少连接。 The file on hdfs is very large(100G). hdfs上的文件非常大(100G)。

Actually, the main idea behind the distributed systems and of course which is designed and implemented in hadoop and spark is to send the process to data. 实际上,分布式系统背后的主要思想(当然是在hadoop和spark中设计和实现)的主要思想是将流程发送到数据。 In other words, imagine that there is some data located on hdfs data nodes on our cluster and we have a job which utilizes that data on the same worker. 换句话说,假设集群中的hdfs数据节点上有一些数据,并且我们有一项工作是在同一工作线程上利用这些数据。 On each machine, you would have a data node and is a spark worker at the same time and may have some other processes like hbase region server too. 在每台机器上,您将有一个数据节点,并且同时是一个火花处理程序,并且可能还有其他一些进程,例如hbase区域服务器。 When an executor is executing one of the scheduled tasks, it retrieves its needed data from the underlying data node. 当执行者执行计划的任务之一时,它从基础数据节点检索其所需的数据。 Then for each individual task you would retrieve its data and so you may describe this as one connection to hdfs on its local data node. 然后,对于每个单独的任务,您将检索其数据,因此可以将其描述为到其本地数据节点上的hdfs的一个连接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM