简体繁体 English

Spark RDD和HDFS数据块之间的区别

[英]Difference between Spark RDDs and HDFS' data blocks

原文 2018-01-31 20:50:37 6 1 hadoop/ apache-spark/ hdfs/ rdd

Please help me to understand the difference between HDFS' data block and the RDDs in Spark. 请帮助我了解HDFS的数据块和Spark中的RDD之间的区别。 HDFS distributes a dataset to multiple nodes in a cluster as blocks with same size and data blocks will be replicated mutiple times and stored. HDFS将数据集分发到群集中的多个节点，因为大小相同的块将被多次复制并存储。 RDDs are created as parallelized collection. RDD创建为并行化集合。 Are the elements of the Parallelized collections distributed across nodes or it will be stored in memory for processing? 并行集合的元素是分布在节点上还是存储在内存中进行处理？ Is there any relation to HDFS' data blocks? 与HDFS的数据块有关系吗？

1 个解决方案

Is there any relation to HDFS' data blocks? 与HDFS的数据块有关系吗？

In general not. 通常不会。 They address different issues 他们解决不同的问题

RDDs are about distributing computation and handling computation failures. RDD与分配计算和处理计算故障有关。
HDFS is about distributing storage and handling storage failures. HDFS与分配存储和处理存储故障有关。

Distribution is common denominator, but that is it, and failure handling strategy are obviously different (DAG re-computation and replication respectively). 分布是共同点，但就是这样，并且故障处理策略明显不同（分别是DAG重新计算和复制）。

Spark can use Hadoop Input Formats, and read data from HDFS. Spark可以使用Hadoop输入格式，并从HDFS读取数据。 In that case there will be a relationship between HDFS blocks and Spark splits. 在这种情况下，HDFS块和Spark拆分之间将存在关系。 However Spark doesn't require HDFS and many components of the newer API don't use Hadoop Input Formats anymore. 但是，Spark不需要HDFS，并且较新API的许多组件都不再使用Hadoop输入格式。