I use Spark 2.1 in local mode and I'm running this simple application.
val N = 10 << 20
sparkSession.conf.set("spark.sql.shuffle.partitions", "5")
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", (N + 1).toString)
sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")
val df1 = sparkSession.range(N).selectExpr(s"id as k1")
val df2 = sparkSession.range(N / 5).selectExpr(s"id * 3 as k2")
df1.join(df2, col("k1") === col("k2")).count()
Here, the range(N) creates a dataset of Long (with unique values), so I assume that the size of
- df1 = N * 8 bytes ~ 80MB
- df2 = N / 5 * 8 bytes ~ 16MB
Ok now let's take df1 as an example. df1 consists of 8 partitions and shuffledRDDs of 5 , so I assume that
- # of mappers (M) = 8
- # of reducers (R) = 5
As the # of partitions is low, Spark will use the Hash Shuffle which will create M * R files in the disk but I haven't understood if every file has all the data, thus each_file_size = data_size resulting to M * R * data_size files or all_files = data_size .
However when executing this app, shuffle write of df1 = 160MB which doesn't match either of the above cases.
What am I missing here? Why has the shuffle write data doubled in size?
First of all, let's see what data size total(min, med, max)
means:
According to SQLMetrics.scala#L88 and ShuffleExchange.scala#L43 , the data size total(min, med, max)
we see is the final value of dataSize
metric of shuffle. Then, how is it updated? It get updated each time a record is serialized: UnsafeRowSerializer.scala#L66 by dataSize.add(row.getSizeInBytes)
( UnsafeRow
is the internal representation of records in Spark SQL).
Internally, UnsafeRow
is backed by a byte[]
, and is copied directly to the underlying output stream during serialization, its getSizeInBytes()
method just return the length of the byte[]
. Therefore, the initial question is transformed to: Why the bytes representation is twice big as the only long
column a record have? This UnsafeRow.scala doc gives us the answer:
Each tuple has three parts: [null bit set] [values] [variable length portion]
The bit set is used for null tracking and is aligned to 8-byte word boundaries. It stores one bit per field.
since it's 8-byte word aligned, the only 1 null bit is taking another 8 byte, the same width as the long column. Therefore, each UnsafeRow
represents your one-long-column-row using 16 bytes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.