How many partitions does Spark create when a file is loaded from S3 bucket?

Question

If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket?

Answer 1

See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits() .

Block size depends on S3 file system implementation (see FileStatus.getBlockSize() ). Eg S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play).

Also, you don't get splits if your InputFormat is not splittable :)

Answer 2

By default spark will create partitions of size 64MB when reading from s3. So a 100 MB file will be split into 2 partitions, 64MB and 36MB. An object having size less than or equal to 64 MB wont be split at all.

Answer 3

Spark will treat S3 as if it were a block-based filesystem, so partitioning rules for HDFS and S3 inputs are the same: by default you will get one partition per one block. It is worth inspecting number of created partitions yourself:

val inputRDD = sc.textFile("s3a://...")
println(inputRDD.partitions.length)

For further reading I suggest this , which covers partitioning rules in detail.

How many partitions does Spark create when a file is loaded from S3 bucket?

Question

2 answers

solution1
1 ACCPTED 2016-05-11 21:31:49

solution2
0 2020-06-28 19:15:12

solution3
-1 2016-05-11 21:35:50

How many partitions does Spark create when a file is loaded from S3 bucket?

Question

2 answers

solution1 1 ACCPTED 2016-05-11 21:31:49

solution2 0 2020-06-28 19:15:12

solution3 -1 2016-05-11 21:35:50

solution1
1 ACCPTED 2016-05-11 21:31:49

solution2
0 2020-06-28 19:15:12

solution3
-1 2016-05-11 21:35:50