简体   繁体   中英

Choose right sort key on Amazon Redshift

I'm creating a table on Amazon Redshift for storing a huge amount of data per day.

I tried to optimize the most I could the database performance using sort keys.

The idea is to be able to serve a wep application with these data through a API performing select queries on it.

After multiple different sort keys, I'm not convinced at all that I'm using the right one. I must be missing / misunderstanding something ...

Table definition :

CREATE TABLE test_table(
  date date NOT NULL,
  country char(2) NOT NULL,
  application_id integer NOT NULL,
  device smallint NOT NULL,
  category smallint NOT NULL,
  subcategory smallint DEFAULT NULL,
  rank smallint DEFAULT NULL,
  subrank smallint DEFAULT NULL,
  is_free smallint NOT NULL,
  downloads integer DEFAULT NULL)
  distkey(application_id)

Data context :

  1. Store between 10 000 000 and 20 000 000 lines/ day
  2. Keep 2 years of history

Sort keys I've already tried :

  1. Compound sortkey(device, is_free, date, country, category, subcategory)
  2. Interleaved sortkey(device, is_free, country, category, subcategory)

Performance test executed (on 18 million generated rows) :

With any of these sort keys, example queries below are always performing during 3s and 7s depending number of given countries / categories and date range.

Query examples :

SELECT country, category, sum(downloads)
FROM test_table
WHERE country IN ('fr','jp', 'de', 'gb', 'us')
AND category in (6014, 6005, 6011, 6004, 6017)
AND device = 0
AND is_free = 1
AND date >= '2019-01-01' AND date <= '2019-04-01'
GROUP BY country, category;
SELECT category, country, rank, avg(downloads)
FROM test_table
WHERE country IN ('br','co', 'ru')
AND category in (6009, 6008, 6000)
AND device = 1
AND is_free = 0
AND rank IN (1, 10, 100)
AND date BETWEEN '2019-03-01' AND '2019-04-01'
GROUP BY category, country, rank;
SELECT category, country, application_id, sum(downloads)
FROM test_table
WHERE country IN ('us', 'cn', 'jp')
AND category in (6010, 6003, 6002)
AND device = 1
AND is_free = 1
AND date BETWEEN '2019-04-01' AND '2019-06-01'
GROUP BY category, country, application_id

Is it possible to make it faster ? Are the selected sort keys bad ? Can I put the date field in the interleaved sort key ? (even if I've read this is a bad idea)

If you think Redshift is not made for this case, do you have other database suggestions (I have no restriction on the technology) ?

Thank you in advance for your help :)

Redshift is absolutely the right choice for this type of query IMO. See my examples below where I get response time of just a few hundred milliseconds on a tiny cluster.

The date or timestamp column should generally be the first column in your compound sort key. Add other columns in descending order of the number of unique values.

Avoid using INTERLEAVED sort keys for tables that you regularly add data to.

Here's an example using the store_sales table from TPC-DS at 100GB scale: 265 million rows. I converted the ss_sold_date_sk and ss_sold_date_sk surrogate keys to a real timestamp.

--   column    | distinct val
-- ss_hdemo_sk |       7,200
-- ss_promo_sk |       1,000
-- ss_store_sk |         201
-- ss_quantity |         100

CREATE TABLE IF NOT EXISTS "store_sales_ts" (…)
DISTSTYLE KEY
DISTKEY ("ss_item_sk")
SORTKEY ("ss_sold_ts"
        ,"ss_hdemo_sk"
        ,"ss_promo_sk"
        ,"ss_store_sk"
        ,"ss_quantity")
;

Timings are taken on a 2 node dc2.large cluster. Result cache is disabled as shown.

SET enable_result_cache_for_session TO off
;
SELECT ss_store_sk
     , COUNT(*)         AS sales_count
     , AVG(ss_quantity) AS avg_quantity
FROM store_sales_ts
WHERE ss_sold_ts BETWEEN '2001-09-01' AND '2001-09-30'
AND ss_store_sk IN (356,241,160,70)
GROUP BY 1
;
--First run: 5415.869 ms 
--Second run: 1485.217 ms
--Third run: 173.262 ms
--Change month: 337.084 ms

SELECT ss_quantity
     , COUNT(*)         AS sales_count
     , AVG(ss_ext_discount_amt) AS avg_discount_amt
FROM store_sales_ts
WHERE ss_sold_ts BETWEEN '2001-09-01' AND '2001-09-30'
AND ss_quantity > 90
GROUP BY 1
;
--First run: 5717.890 ms
--Second run: 206.465 ms
--Change year: 210.091 ms

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM