简体   繁体   中英

Big input data for Spark job

I have 1800 *.gz files under a folder input. Each *.gz file is around 300 M, and after unzip, each file is around 3G. So totally 5400 G when unzipped.

I can't have a cluster with 5400G executor-memory. Is it possible to read all files under input folder like below?

JavaRDD lines = ctx.textFile("input");

So how many executor-memory do I need for this job? How does Spark deal with the situation when the data can not all fit into the memory?

Thanks!

Creating an RDD object pointing to a directory of text files does not by itself load any of the data set into memory. Data is only loaded into memory when you tell Spark to process it, and in many (most?) cases this still does not require the full data set to be in memory at the same time. How much memory you will need for your 5.4TB data set really depends on what you are going to do with it.

That said, there are options for how an RDD can be persisted as it loaded. By default Spark will keep the data only in memory, but there are also configurations to spill to disk when there is no available memory. There is a good write-up of this in the Spark programming guide .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM