简体   繁体   中英

How many partitions does Spark create when a file is loaded from S3 bucket?

If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket?

See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits() .

Block size depends on S3 file system implementation (see FileStatus.getBlockSize() ). Eg S3AFileStatus just set it equals to 0 (and then FileInputFormat.computeSplitSize() comes into play).

Also, you don't get splits if your InputFormat is not splittable :)

By default spark will create partitions of size 64MB when reading from s3. So a 100 MB file will be split into 2 partitions, 64MB and 36MB. An object having size less than or equal to 64 MB wont be split at all.

Spark will treat S3 as if it were a block-based filesystem, so partitioning rules for HDFS and S3 inputs are the same: by default you will get one partition per one block. It is worth inspecting number of created partitions yourself:

val inputRDD = sc.textFile("s3a://...")
println(inputRDD.partitions.length)

For further reading I suggest this , which covers partitioning rules in detail.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM