[英]Spark SQL: Cache Memory footprint improves with 'order by'
I have two scenarios where I have 23 GB
partitioned parquet
data and reading few of the columns
& caching
it upfront to fire a series of subsequent queries later on.我有两个场景,其中我有
23 GB
分区parquet
数据并读取很少的columns
并预先caching
它以稍后触发一系列后续查询。
Setup :设置:
Case 1 :案例一:
val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase")
dfMain.cache.count
From SparkUI
, the input data read is 6.2 GB and the cached object is of 15.1 GB .从
SparkUI
读取的输入数据为 6.2 GB,缓存对象为15.1 GB 。
Case 1 :案例一:
val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase order by pogId")
dfMain.cache.count
From SparkUI
, the input data read is 6.2 GB and the cached object is of 5.5 GB .从
SparkUI
读取的输入数据为 6.2 GB,缓存对象为5.5 GB 。
Any explanation, or code-reference to this behavior?对此行为的任何解释或代码参考?
It is actually relatively simple.其实还是比较简单的。 As you can read in the SQL guide:
正如您在 SQL 指南中所读到的:
Spark SQL can cache tables using an in-memory columnar format ... Spark SQL will scan only required columns and will automatically tune compression
Spark SQL 可以使用内存中的列格式缓存表... Spark SQL 将仅扫描所需的列并自动调整压缩
Nice thing about sorted columnar storage is that it is very easy to compress on typical data.排序列式存储的好处在于它很容易压缩典型数据。 When you sort, you get these blocks of the similar records which can be squashed together using even very simple techniques like RLE .
排序时,您会得到这些相似记录的块,甚至可以使用非常简单的技术(如RLE )将它们压缩在一起。
This is a property that is actually used quite often in databases with columnar storage because it is not only very efficient in terms of storage but also aggregations.这是一个在具有列式存储的数据库中实际上经常使用的属性,因为它不仅在存储方面非常有效,而且在聚合方面也非常有效。
Different aspects of the Spark columnar compression are covered by the sql.execution.columnar.compression
package and as you can see RunLengthEncoding
is indeed one of the available compressions schemes. sql.execution.columnar.compression
包涵盖了 Spark 列压缩的不同方面,如您所见, RunLengthEncoding
确实是可用的压缩方案之一。
So there are two pieces here:所以这里有两部分:
Spark can adjust compression method on the fly based on the statistics : Spark 可以根据统计信息动态调整压缩方法:
Spark SQL will automatically select a compression codec for each column based on statistics of the data.
Spark SQL 会根据数据的统计信息自动为每一列选择一个压缩编解码器。
sorting can cluster similar records together making compression much more efficient.排序可以将相似的记录聚集在一起,从而使压缩效率更高。
If there are some correlations between columns (when it is not the case?) even a simple sort based on a single column can have relatively large impact and improve the performance of different compression schemes.如果列之间存在一些相关性(如果不是这种情况?)即使是基于单个列的简单排序也会产生相对较大的影响并提高不同压缩方案的性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.