简体繁体中英

Big input data for Spark job

原文 2015-02-20 01:04:36 0 1 apache-spark/ yarn

I have 1800 *.gz files under a folder input. Each *.gz file is around 300 M, and after unzip, each file is around 3G. So totally 5400 G when unzipped.

I can't have a cluster with 5400G executor-memory. Is it possible to read all files under input folder like below?

JavaRDD lines = ctx.textFile("input");

So how many executor-memory do I need for this job? How does Spark deal with the situation when the data can not all fit into the memory?

Thanks!

1 answers

Creating an RDD object pointing to a directory of text files does not by itself load any of the data set into memory. Data is only loaded into memory when you tell Spark to process it, and in many (most?) cases this still does not require the full data set to be in memory at the same time. How much memory you will need for your 5.4TB data set really depends on what you are going to do with it.

That said, there are options for how an RDD can be persisted as it loaded. By default Spark will keep the data only in memory, but there are also configurations to spill to disk when there is no available memory. There is a good write-up of this in the Spark programming guide .

Spark job with no input dataset

Spark job on hbase data

filtering data on Big number in Spark

Consume a big data by Kafka and Spark

Big Data Analytics Using Spark

Begenner at spark Big data programming (spark code)

Can output of Spark job used as input for another Spark job?

Spark Job for Inserting data to Cassandra

Job executed with no data in Spark Streaming

Spark OutOfMemoryError when taking a big input file

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark job with no input dataset Spark job on hbase data filtering data on Big number in Spark Consume a big data by Kafka and Spark Big Data Analytics Using Spark Begenner at spark Big data programming (spark code) Can output of Spark job used as input for another Spark job? Spark Job for Inserting data to Cassandra Job executed with no data in Spark Streaming Spark OutOfMemoryError when taking a big input file

Related Tags

Big input data for Spark job

Question

1 answers

solution1 2 ACCPTED 2015-02-20 02:01:05

solution1
2 ACCPTED 2015-02-20 02:01:05