Spark SQL：通过“order by”改善缓存内存占用

Question

I have two scenarios where I have 23 GB partitioned parquet data and reading few of the columns & caching it upfront to fire a series of subsequent queries later on.我有两个场景，其中我有23 GB分区parquet数据并读取很少的columns并预先caching它以稍后触发一系列后续查询。

Setup :设置：

Cluster: 12 Node EMR集群：12 节点 EMR
Spark Version: 1.6火花版本：1.6
Spark Configurations: Default Spark 配置：默认
Run Configurations: Same for both cases运行配置：两种情况相同

Case 1 :案例一：

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase")
dfMain.cache.count

From SparkUI , the input data read is 6.2 GB and the cached object is of 15.1 GB .从SparkUI读取的输入数据为 6.2 GB，缓存对象为15.1 GB 。

Case 1 :案例一：

val paths = Array("s3://my/parquet/path", ...)
val parqFile = sqlContext.read.parquet(paths:_*)
parqFile.registerTempTable("productViewBase")
val dfMain = sqlContext.sql("select guid,email,eventKey,timestamp,pogId from productViewBase order by pogId")
dfMain.cache.count

From SparkUI , the input data read is 6.2 GB and the cached object is of 5.5 GB .从SparkUI读取的输入数据为 6.2 GB，缓存对象为5.5 GB 。

Any explanation, or code-reference to this behavior?对此行为的任何解释或代码参考？

Answer 1

It is actually relatively simple.其实还是比较简单的。 As you can read in the SQL guide:正如您在 SQL 指南中所读到的：

Spark SQL can cache tables using an in-memory columnar format ... Spark SQL will scan only required columns and will automatically tune compression Spark SQL 可以使用内存中的列格式缓存表... Spark SQL 将仅扫描所需的列并自动调整压缩

Nice thing about sorted columnar storage is that it is very easy to compress on typical data.排序列式存储的好处在于它很容易压缩典型数据。 When you sort, you get these blocks of the similar records which can be squashed together using even very simple techniques like RLE .排序时，您会得到这些相似记录的块，甚至可以使用非常简单的技术（如RLE ）将它们压缩在一起。

This is a property that is actually used quite often in databases with columnar storage because it is not only very efficient in terms of storage but also aggregations.这是一个在具有列式存储的数据库中实际上经常使用的属性，因为它不仅在存储方面非常有效，而且在聚合方面也非常有效。

Different aspects of the Spark columnar compression are covered by the sql.execution.columnar.compression package and as you can see RunLengthEncoding is indeed one of the available compressions schemes. sql.execution.columnar.compression包涵盖了 Spark 列压缩的不同方面，如您所见， RunLengthEncoding确实是可用的压缩方案之一。

So there are two pieces here:所以这里有两部分：

Spark can adjust compression method on the fly based on the statistics : Spark 可以根据统计信息动态调整压缩方法：

Spark SQL will automatically select a compression codec for each column based on statistics of the data. Spark SQL 会根据数据的统计信息自动为每一列选择一个压缩编解码器。
sorting can cluster similar records together making compression much more efficient.排序可以将相似的记录聚集在一起，从而使压缩效率更高。

If there are some correlations between columns (when it is not the case?) even a simple sort based on a single column can have relatively large impact and improve the performance of different compression schemes.如果列之间存在一些相关性（如果不是这种情况？）即使是基于单个列的简单排序也会产生相对较大的影响并提高不同压缩方案的性能。

Spark SQL：通过“order by”改善缓存内存占用

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-03-26 16:49:33

Spark SQL：通过“order by”改善缓存内存占用

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-03-26 16:49:33

解决方案1
3 已采纳 2016-03-26 16:49:33