Apache Spark - shuffle writes more data than the size of the input data

Question

I use Spark 2.1 in local mode and I'm running this simple application.

val N = 10 << 20

sparkSession.conf.set("spark.sql.shuffle.partitions", "5")
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", (N + 1).toString)
sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")

val df1 = sparkSession.range(N).selectExpr(s"id as k1")
val df2 = sparkSession.range(N / 5).selectExpr(s"id * 3 as k2")

df1.join(df2, col("k1") === col("k2")).count()

Here, the range(N) creates a dataset of Long (with unique values), so I assume that the size of

df1 = N * 8 bytes ~ 80MB

df2 = N / 5 * 8 bytes ~ 16MB

Ok now let's take df1 as an example. df1 consists of 8 partitions and shuffledRDDs of 5 , so I assume that

# of mappers (M) = 8

# of reducers (R) = 5

As the # of partitions is low, Spark will use the Hash Shuffle which will create M * R files in the disk but I haven't understood if every file has all the data, thus each_file_size = data_size resulting to M * R * data_size files or all_files = data_size .

However when executing this app, shuffle write of df1 = 160MB which doesn't match either of the above cases.

Spark UI

What am I missing here? Why has the shuffle write data doubled in size?

Answer 1

First of all, let's see what data size total(min, med, max) means:

According to SQLMetrics.scala#L88 and ShuffleExchange.scala#L43 , the data size total(min, med, max) we see is the final value of dataSize metric of shuffle. Then, how is it updated? It get updated each time a record is serialized: UnsafeRowSerializer.scala#L66 by dataSize.add(row.getSizeInBytes) ( UnsafeRow is the internal representation of records in Spark SQL).

Internally, UnsafeRow is backed by a byte[] , and is copied directly to the underlying output stream during serialization, its getSizeInBytes() method just return the length of the byte[] . Therefore, the initial question is transformed to: Why the bytes representation is twice big as the only long column a record have? This UnsafeRow.scala doc gives us the answer:

Each tuple has three parts: [null bit set] [values] [variable length portion]

The bit set is used for null tracking and is aligned to 8-byte word boundaries. It stores one bit per field.

since it's 8-byte word aligned, the only 1 null bit is taking another 8 byte, the same width as the long column. Therefore, each UnsafeRow represents your one-long-column-row using 16 bytes.

Apache Spark - shuffle writes more data than the size of the input data

Question

1 answers

solution1
4 ACCPTED 2017-05-19 03:31:56

Apache Spark - shuffle writes more data than the size of the input data

Question

1 answers

solution1 4 ACCPTED 2017-05-19 03:31:56

solution1
4 ACCPTED 2017-05-19 03:31:56