简体   繁体   中英

Data skipping index for Map or pair-wise arrays in Clickhouse?

I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.

I checked Clickhouse's semi-structured JSON data options , and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.

However I cannot make a proper assessment which one is best, as I get pretty similar results.

To test performance I created the following table:

CREATE TABLE maptest
 (
     `k` Int64,
     `keys` Array(String),
     `values` Array(String),
     `map` Map(String, String)
  )
 ENGINE = MergeTree
 ORDER BY k
 SETTINGS index_granularity = 8192;

insert into maptest 
select 
    number, 
    mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))), 
    mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))), 
    map(concat('custom', toString(number%87000)), toString(number%87000)) 
from numbers(200000000);

--- data look like these:

SELECT *
FROM maptest
LIMIT 1

Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0

┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0']  │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘

However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. eg

SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0

┌─count()─┐
│    2299 │
└─────────┘

1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)

SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'


┌─count()─┐
│    2299 │
└─────────┘

1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)



SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'


┌─count()─┐
│    2299 │
└─────────┘

1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)

Any suggestions on data skipping indexes for any of the 2 options?

You can add data skipping index for a map field, although you will need to set lower index_granularity to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues (or mapKeys , depending on your needs) map function:

CREATE TABLE maptest
 (
     `k` Int64,
     `keys` Array(String),
     `values` Array(String),
     `map` Map(String, String),
      INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
  )
 ENGINE = MergeTree
 ORDER BY k
 SETTINGS index_granularity = 2048; -- < lowered index_granularity!

insert into maptest 
select 
    number, 
    mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))), 
    mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))), 
    map(concat('custom', toString(number%87000)), toString(number%87000)) 
from numbers(20000000);

Now let's test it:

set send_logs_level='trace';

SELECT count()
FROM maptest
WHERE (map['custom2']) = '2';

(...)
[LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
(...)
┌─count()─┐
│     230 │
└─────────┘
(...)
1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM