简体繁体 English

Presto 是否开箱即用地在内部缓存中间结果？

[英]Does Presto cache intermediate results internally out of the box?

原文 2017-03-09 22:35:20 4 2 presto/ amazon-athena

Presto has multi connectors . Presto 有多个连接器。 While the connectors do implement read and write operations, from all the tutorials I read, it seems they are typically used as data sources to read from only.虽然连接器确实实现了读写操作，但从我阅读的所有教程来看，它们似乎通常用作仅供读取的数据源。 For example, netflix has "10 petabyte" of data on Amazon S3 and they explicitly state that no disk (and no HDFS) is used on the Presto worker nodes.例如， netflix在 Amazon S3 上有“10 PB”的数据，他们明确声明在 Presto 工作节点上没有使用磁盘（也没有 HDFS）。 The stated use case are "ad hoc interactive" queries.所述用例是“临时交互式”查询。

Also, Amazon Athena is essentially S3+Presto and comes with similar use cases.此外，Amazon Athena 本质上是 S3+Presto，并且具有类似的用例。

I'm puzzled how this can work in practice.我很困惑这在实践中如何运作。 Obviously, you don't want to read 10 PB of data on every query.显然，您不希望每次查询都读取 10 PB 的数据。 So I assume, you want to keep some previously fetched data in memory, such as a database index.所以我假设，您想在内存中保留一些以前获取的数据，例如数据库索引。 However, with no constraints on the data and the queries, I fail to understand how this can be efficient.但是，由于对数据和查询没有任何限制，我无法理解这如何有效。

Use case 1: I run the same query frequently, eg to show metric on a dashboard.用例 1：我经常运行相同的查询，例如在仪表板上显示指标。 Does Presto avoid rescanning the data points which are already 'known'? Presto 是否避免重新扫描已经“已知”的数据点？

Use case 2: I'm analysing a large data set.用例 2：我正在分析一个大型数据集。 Each query is slightly different, however there are common subqueries or we filter to a common subset of the data.每个查询都略有不同，但是有共同的子查询，或者我们过滤到数据的共同子集。 Does Presto learn from previous queries and carry over intermediate results? Presto 是否从以前的查询中学习并继承中间结果？

Or, if this is not the case, would I be well advised to store intermediate results somewhere (eg CREATE TABLE AS...)?或者，如果不是这种情况，是否建议我将中间结果存储在某处（例如 CREATE TABLE AS...）？

2 个解决方案

There is no data caching tier for Presto itself. Presto 本身没有数据缓存层。 To be honest, I don't think the features you are proposing here are supposed to be provided by Presto as a SQL analytics engine.老实说，我不认为您在这里提出的功能应该由 Presto 作为 SQL 分析引擎提供。 For both of use cases you mentioned, I suggest deploying Alluxio together with Presto as a caching layer to help:对于你提到的两个用例，我建议将 Alluxio 与 Presto 一起部署作为缓存层来帮助：

Use case 1: I run the same query frequently, eg to show metric on a dashboard.用例 1：我经常运行相同的查询，例如在仪表板上显示指标。 Does Presto avoid rescanning the data points which are already 'known'? Presto 是否避免重新扫描已经“已知”的数据点？

As a caching layer, Alluxio can detect the data access pattern from Presto (or other applications) and make caching/eviction decisions to serve the most frequently used data in a memory tier (up to your configuration, can be SSD or HDD too).作为缓存层，Alluxio 可以检测来自 Presto（或其他应用程序）的数据访问模式，并做出缓存/驱逐决策以服务于内存层中最常用的数据（根据您的配置，也可以是 SSD 或 HDD）。 This will help when the data access is not consistent.当数据访问不一致时，这将有所帮助。

Use case 2: I'm analysing a large data set.用例 2：我正在分析一个大型数据集。 Each query is slightly different, however there are common subqueries or we filter to a common subset of the data.每个查询都略有不同，但是有共同的子查询，或者我们过滤到数据的共同子集。 Does Presto learn from previous queries and carry over intermediate results? Presto 是否会从之前的查询中学习并继承中间结果？

With more knowledge on your input data, you can enforce data policies in Alluxio to (1) preload data (common subqueries) into the caching space, (2) set TTL to retire data from Alluxio caching space to make room for other hot data, (3) set caching policies on certain input path (eg, CACHE on certain paths, NO CACHE on some other paths).随着对输入数据的更多了解，您可以在 Alluxio 中执行数据策略，以 (1) 将数据（常见子查询）预加载到缓存空间中，(2) 设置 TTL 以从 Alluxio 缓存空间中退出数据，为其他热数据腾出空间， (3) 在某些输入路径上设置缓存策略（例如，在某些路径上缓存，在其他路径上不缓存）。

Checkout more tips to run Presto/Alluxio stack: https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1/查看更多运行 Presto/Alluxio 堆栈的技巧： https ://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1/

As far as I know there is no intermediate implicit caching layer.据我所知，没有中间隐式缓存层。 When you use HDFS on your cluster, you surely benefit from OS disk caches, so next query run will be faster, but you won't get instant cached results.当您在集群上使用 HDFS 时，您肯定会受益于操作系统磁盘缓存，因此下一次查询运行会更快，但您不会获得即时缓存结果。 Similar data block-level caching might apply to S3 too.类似的数据块级缓存也可能适用于 S3。

Generally, no reasonably-sized system can sift through 10 petabytes of data, since reading all that data would take a lot of time.通常，没有合理大小的系统可以筛选 10 PB 的数据，因为读取所有数据会花费大量时间。 However, data can be partitioned so that Presto knows more or less which pieces of data need to be scanned.但是，可以对数据进行分区，以便 Presto 或多或少知道需要扫描哪些数据。 When partitioning aligns with query conditions (eg you partition data by data and you query for most recent data), this can work really well.当分区与查询条件一致时（例如，按数据对数据进行分区并查询最近的数据），这会非常有效。

When your data is not partitioned the same way you query, and you don't want to re-partition it differently, saving temporary results with create table ... as select makes much sense.当您的数据分区与查询的方式不同，并且您不想以不同的方式重新分区时，使用create table ... as select保存临时结果很有意义。 You can also store such temporary tables using some in-memory storage, eg raptor (currently undocumented) or memory connectors for even faster access.您还可以使用一些内存存储来存储此类临时表，例如raptor （当前未记录）或memory连接器，以便更快地访问。

For some starter tips about partitioning, tuning storage and queries you can have a look at https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ .有关分区、调整存储和查询的一些入门技巧，您可以查看https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ 。

A common optimization to improve Presto query latency is to cache the working set to avoid unnecessary I/O from remote data sources or through a slow network.改善 Presto 查询延迟的常见优化是缓存工作集，以避免来自远程数据源或通过慢速网络进行的不必要的 I/O。 This section describes following options to leverage Alluxio as a caching layer for Presto.本节描述了以下选项来利用 Alluxio 作为 Presto 的缓存层。

Alluxio File System serves Presto Hive Connector as an independent distributed caching file system on top of HDFS or object stores like AWS S3, GCP, Azure blob store. Alluxio 文件系统将 Presto Hive 连接器作为独立的分布式缓存文件系统提供给 HDFS 或对象存储（如 AWS S3、GCP、Azure blob 存储）。 Users can understand the cache usage and control cache explicitly through a file system interface.用户可以通过文件系统接口明确了解缓存使用情况并控制缓存。 For example, one can preload all files in an Alluxio directory to warm the cache for Presto queries, and set the TTL (time-to-live) for cached data to reclaim cache capacity.例如，可以预加载 Alluxio 目录中的所有文件以预热 Presto 查询的缓存，并为缓存数据设置 TTL（生存时间）以回收缓存容量。

Alluixo Structured Data Service interacts with Presto with both a catalog and a caching file system based on Option1. Alluixo 结构化数据服务通过基于 Option1 的目录和缓存文件系统与 Presto 交互。 This option provides additional benefits on top of option 1 in terms of seamless access to existing Hive tables without modifying table locations on Hive Metastore and further performance optimization by consolidating many small files or transforming formats of input files.此选项在选项 1 之上提供了额外的好处，即无需修改 Hive Metastore 上的表位置即可无缝访问现有 Hive 表，并通过合并许多小文件或转换输入文件的格式来进一步优化性能。

Source: Presto Alluxio Cache Service资料来源： Presto Alluxio 缓存服务