简体   繁体   中英

Does Presto cache intermediate results internally out of the box?

Presto has multi connectors . While the connectors do implement read and write operations, from all the tutorials I read, it seems they are typically used as data sources to read from only. For example, netflix has "10 petabyte" of data on Amazon S3 and they explicitly state that no disk (and no HDFS) is used on the Presto worker nodes. The stated use case are "ad hoc interactive" queries.

Also, Amazon Athena is essentially S3+Presto and comes with similar use cases.

I'm puzzled how this can work in practice. Obviously, you don't want to read 10 PB of data on every query. So I assume, you want to keep some previously fetched data in memory, such as a database index. However, with no constraints on the data and the queries, I fail to understand how this can be efficient.

Use case 1: I run the same query frequently, eg to show metric on a dashboard. Does Presto avoid rescanning the data points which are already 'known'?

Use case 2: I'm analysing a large data set. Each query is slightly different, however there are common subqueries or we filter to a common subset of the data. Does Presto learn from previous queries and carry over intermediate results?

Or, if this is not the case, would I be well advised to store intermediate results somewhere (eg CREATE TABLE AS...)?

There is no data caching tier for Presto itself. To be honest, I don't think the features you are proposing here are supposed to be provided by Presto as a SQL analytics engine. For both of use cases you mentioned, I suggest deploying Alluxio together with Presto as a caching layer to help:

Use case 1: I run the same query frequently, eg to show metric on a dashboard. Does Presto avoid rescanning the data points which are already 'known'?

As a caching layer, Alluxio can detect the data access pattern from Presto (or other applications) and make caching/eviction decisions to serve the most frequently used data in a memory tier (up to your configuration, can be SSD or HDD too). This will help when the data access is not consistent.

Use case 2: I'm analysing a large data set. Each query is slightly different, however there are common subqueries or we filter to a common subset of the data. Does Presto learn from previous queries and carry over intermediate results?

With more knowledge on your input data, you can enforce data policies in Alluxio to (1) preload data (common subqueries) into the caching space, (2) set TTL to retire data from Alluxio caching space to make room for other hot data, (3) set caching policies on certain input path (eg, CACHE on certain paths, NO CACHE on some other paths).

Checkout more tips to run Presto/Alluxio stack: https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1/

As far as I know there is no intermediate implicit caching layer. When you use HDFS on your cluster, you surely benefit from OS disk caches, so next query run will be faster, but you won't get instant cached results. Similar data block-level caching might apply to S3 too.

Generally, no reasonably-sized system can sift through 10 petabytes of data, since reading all that data would take a lot of time. However, data can be partitioned so that Presto knows more or less which pieces of data need to be scanned. When partitioning aligns with query conditions (eg you partition data by data and you query for most recent data), this can work really well.

When your data is not partitioned the same way you query, and you don't want to re-partition it differently, saving temporary results with create table ... as select makes much sense. You can also store such temporary tables using some in-memory storage, eg raptor (currently undocumented) or memory connectors for even faster access.

For some starter tips about partitioning, tuning storage and queries you can have a look at https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ .

A common optimization to improve Presto query latency is to cache the working set to avoid unnecessary I/O from remote data sources or through a slow network. This section describes following options to leverage Alluxio as a caching layer for Presto.

Alluxio File System serves Presto Hive Connector as an independent distributed caching file system on top of HDFS or object stores like AWS S3, GCP, Azure blob store. Users can understand the cache usage and control cache explicitly through a file system interface. For example, one can preload all files in an Alluxio directory to warm the cache for Presto queries, and set the TTL (time-to-live) for cached data to reclaim cache capacity.

Alluixo Structured Data Service interacts with Presto with both a catalog and a caching file system based on Option1. This option provides additional benefits on top of option 1 in terms of seamless access to existing Hive tables without modifying table locations on Hive Metastore and further performance optimization by consolidating many small files or transforming formats of input files.

Source: Presto Alluxio Cache Service

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM