简体   繁体   中英

Apache Spark - shuffle writes more data than the size of the input data

I use Spark 2.1 in local mode and I'm running this simple application.

val N = 10 << 20

sparkSession.conf.set("spark.sql.shuffle.partitions", "5")
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", (N + 1).toString)
sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")

val df1 = sparkSession.range(N).selectExpr(s"id as k1")
val df2 = sparkSession.range(N / 5).selectExpr(s"id * 3 as k2")

df1.join(df2, col("k1") === col("k2")).count()

Here, the range(N) creates a dataset of Long (with unique values), so I assume that the size of

  • df1 = N * 8 bytes ~ 80MB
  • df2 = N / 5 * 8 bytes ~ 16MB

Ok now let's take df1 as an example. df1 consists of 8 partitions and shuffledRDDs of 5 , so I assume that

  • # of mappers (M) = 8
  • # of reducers (R) = 5

As the # of partitions is low, Spark will use the Hash Shuffle which will create M * R files in the disk but I haven't understood if every file has all the data, thus each_file_size = data_size resulting to M * R * data_size files or all_files = data_size .

However when executing this app, shuffle write of df1 = 160MB which doesn't match either of the above cases.

Spark UI

What am I missing here? Why has the shuffle write data doubled in size?

First of all, let's see what data size total(min, med, max) means:

According to SQLMetrics.scala#L88 and ShuffleExchange.scala#L43 , the data size total(min, med, max) we see is the final value of dataSize metric of shuffle. Then, how is it updated? It get updated each time a record is serialized: UnsafeRowSerializer.scala#L66 by dataSize.add(row.getSizeInBytes) ( UnsafeRow is the internal representation of records in Spark SQL).

Internally, UnsafeRow is backed by a byte[] , and is copied directly to the underlying output stream during serialization, its getSizeInBytes() method just return the length of the byte[] . Therefore, the initial question is transformed to: Why the bytes representation is twice big as the only long column a record have? This UnsafeRow.scala doc gives us the answer:

Each tuple has three parts: [null bit set] [values] [variable length portion]

The bit set is used for null tracking and is aligned to 8-byte word boundaries. It stores one bit per field.

since it's 8-byte word aligned, the only 1 null bit is taking another 8 byte, the same width as the long column. Therefore, each UnsafeRow represents your one-long-column-row using 16 bytes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM