简体繁体 English

Spark作业的大输入数据

[英]Big input data for Spark job

原文 2015-02-20 01:04:36 8 1 apache-spark/ yarn

I have 1800 *.gz files under a folder input. 我在文件夹输入下有1800 * .gz文件。 Each *.gz file is around 300 M, and after unzip, each file is around 3G. 每个* .gz文件约为300 M，解压缩后，每个文件约为3G。 So totally 5400 G when unzipped. 因此，解压缩后总计为5400G。

I can't have a cluster with 5400G executor-memory. 我无法使用具有5400G执行程序内存的群集。 Is it possible to read all files under input folder like below? 是否可以读取输入文件夹下的所有文件，如下所示？

JavaRDD lines = ctx.textFile("input"); JavaRDD行= ctx.textFile（“ input”）;

So how many executor-memory do I need for this job? 那我需要多少个执行者内存呢？ How does Spark deal with the situation when the data can not all fit into the memory? 当数据不能全部放入内存时，Spark如何处理情况？

Thanks! 谢谢！

1 个解决方案

Creating an RDD object pointing to a directory of text files does not by itself load any of the data set into memory. 创建指向文本文件目录的RDD对象本身不会将任何数据集加载到内存中。 Data is only loaded into memory when you tell Spark to process it, and in many (most?) cases this still does not require the full data set to be in memory at the same time. 仅当您告诉Spark处理数据时，才将数据加载到内存中，并且在许多（大多数情况下）情况下，仍然不需要将完整的数据集同时存储在内存中。 How much memory you will need for your 5.4TB data set really depends on what you are going to do with it. 5.4TB数据集所需的内存量实际上取决于您将如何处理它。

That said, there are options for how an RDD can be persisted as it loaded. 就是说，有一些关于如何在加载RDD时保持其持久性的选项。 By default Spark will keep the data only in memory, but there are also configurations to spill to disk when there is no available memory. 默认情况下，Spark仅将数据保留在内存中，但是当没有可用内存时，还有一些配置会溢出到磁盘。 There is a good write-up of this in the Spark programming guide . Spark编程指南中对此有很好的记录。