Postgres多個列上的多個謂詞

Question

編輯：

我想我會解釋我正在嘗試做的事情，以便有人可能比我所要求的更好地了解如何編寫查詢。

我有一個大約有 5 億行的表，另一個有大約 50M 行的表。

表定義如下

CREATE TABLE NGRAM_CONTENT
(
    id  BIGINT NOT NULL PRIMARY KEY,
    ref TEXT   NOT NULL,
    data      TEXT
);

CREATE INDEX idx_reference_ngram_content ON NGRAM_CONTENT (ref);
CREATE INDEX idx_id_ngram_content ON NGRAM_CONTENT (id);


CREATE TABLE NGRAMS
(
    id  BIGINT NOT NULL,
    ngram   TEXT   NOT NULL,
    ref TEXT   NOT NULL,
    name_length INT NOT NULL
);

CREATE INDEX combined_index ON NGRAMS (name_length, ngram, ref, id);
CREATE INDEX namelength_idx ON NGRAMS (name_length);
CREATE INDEX id_idx ON NGRAMS (id);
CREATE INDEX ref_idx ON NGRAMS (ref);
CREATE INDEX ngram_idx ON NGRAMS (ngram);

為了使用批量快速插入，已標記為已刪除的上游事件使用 null 插入null表的數據列，並且沒有設置外部約束，但是 ngrams 表中的id和ref都是 NGRAM_CONTENT 的外鍵桌子。

一些樣本數據

Ngram_Content:
|id | ref  | data       |
| 1 | 'P1' | some_json  |
| 2 | 'P1' | some_new_json  | # P1 comes again as an update
| 3 | 'P2' | P3  | 
| 4 | 'P1' | null  | 

Ngrams: 

name_length | ngram | ref  | id |
12          | CH    | 'P1' | 1  |
12          | AN    | 'P1' | 1  |
14          | NEW   | 'P1' | 2  |
20          | CH    | 'P2' | 3  |
20          | CHAI  | 'P2' | 3  |
...

對於上述數據，如果我搜索 id <= 1 的 'CH' 或 'AN' 的 ngram，那么它將返回帶有內容some_json的P1但是如果我使用 id <= 2 搜索，那么它將不匹配，因為最新在id=2處已更新為NEW ，如果我搜索 id <= 5 的NEW ，那么它也不會返回任何內容，因為最新的P1已被刪除。

所有搜索都應在name_length from 和 to 的距離內完成。

換句話說，只查找給定ref的最新 ngram 內容，在name_length的限制內沒有被刪除到某個id

我需要支持 2 個條件 1. 使用事件 id（用於歷史運行） 2. 沒有事件 id 使用最新的

所以我想出了2個這樣的變化

使用 event_id：

select w.* From NGRAM_CONTENT  w
inner join (
    select max(w.id) as w_max_event_id, w.ref from NGRAMS w
    inner join (
            select max(id) as max_event_id, ref from NGRAMS  where
                name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
            on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref) wi
    on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;

沒有 event_id：

select w.* From NGRAM_CONTENT  w
inner join (
    select max(w.id) as w_max_event_id, w.ref from NGRAMS w
    inner join (
            select max(id) as max_event_id, ref from NGRAMS  where
                name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) group by ref having count(ref) >= a_threshold) i
            on w.ref = i.ref where w.id >= i.max_event_id group by w.ref) wi
    on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;

這兩個查詢都需要很長時間才能運行，並且在運行查詢解釋時，Postgres 顯示為完整掃描。

SEQ_SCAN (Seq Scan)  table: NGAMS;  121494200   3358896.0   0.0 Node Type = Seq Scan;
Parent Relationship = Outer;
Parallel Aware = true;
Relation Name = NGRAMS;
Alias = w_1;
Startup Cost = 0.0;
Total Cost = 3358896.0;
Plan Rows = 121494200;
Plan Width = 16;

帶有execute (analyze, buffers) query的詳細執行計划

 Nested Loop  (cost=5032852.92..6943974.42 rows=1 width=381) (actual time=50787.356..52095.938 rows=9437 loops=1)
   Buffers: shared hit=149882 read=769965, temp read=732 written=736
   ->  Finalize GroupAggregate  (cost=5032852.35..5125447.71 rows=265783 width=16) (actual time=50785.079..50808.811 rows=9437 loops=1)
         Group Key: w_1.ref
         Buffers: shared hit=114072 read=758535, temp read=732 written=736
         ->  Gather Merge  (cost=5032852.35..5120132.05 rows=531566 width=16) (actual time=50785.072..50801.624 rows=10261 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
               ->  Partial GroupAggregate  (cost=5031852.33..5057776.12 rows=265783 width=16) (actual time=50766.172..50777.757 rows=3420 loops=3)
                     Group Key: w_1.ref
                     Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
                     ->  Sort  (cost=5031852.33..5039607.65 rows=3102128 width=16) (actual time=50766.163..50769.734 rows=41777 loops=3)
                           Sort Key: w_1.ref
                           Sort Method: quicksort  Memory: 3251kB
                           Worker 0:  Sort Method: quicksort  Memory: 3326kB
                           Worker 1:  Sort Method: quicksort  Memory: 3396kB
                           Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
                           ->  Hash Join  (cost=787482.50..4591332.06 rows=3102128 width=16) (actual time=14787.585..50749.022 rows=41777 loops=3)
                                 Hash Cond: (w_1.ref = i.ref)
                                 Join Filter: (w_1.id >= i.max_event_id)
                                 Buffers: shared hit=343708 read=2276169, temp read=2196 written=2208
                                 ->  Parallel Seq Scan on NGRAMS w_1  (cost=0.00..3662631.50 rows=53797008 width=16) (actual time=0.147..30898.313 rows=38518899 loops=3)
                                       Filter: (id < 45000000)
                                       Rows Removed by Filter: 58676466
                                       Buffers: shared hit=15819 read=2128135
                                 ->  Hash  (cost=786907.78..786907.78 rows=45978 width=16) (actual time=14767.179..14767.180 rows=9437 loops=3)
                                       Buckets: 65536  Batches: 1  Memory Usage: 955kB
                                       Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                       ->  Subquery Scan on i  (cost=782779.42..786907.78 rows=45978 width=16) (actual time=14669.187..14764.701 rows=9437 loops=3)
                                             Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                             ->  GroupAggregate  (cost=782779.42..786448.00 rows=45978 width=16) (actual time=14669.186..14763.369 rows=9437 loops=3)
                                                   Group Key: NGRAMS.ref
                                                   Filter: (count(NGRAMS.ref) >= 2)
                                                   Rows Removed by Filter: 210038
                                                   Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                                   ->  Sort  (cost=782779.42..783265.52 rows=194442 width=16) (actual time=14669.164..14708.948 rows=229489 loops=3)
                                                         Sort Key: NGRAMS.ref
                                                         Sort Method: external merge  Disk: 5856kB
                                                         Worker 0:  Sort Method: external merge  Disk: 5856kB
                                                         Worker 1:  Sort Method: external merge  Disk: 5856kB
                                                         Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
                                                         ->  Index Only Scan using combined_index on NGRAMS  (cost=0.57..762373.68 rows=194442 width=16) (actual time=0.336..14507.098 rows=229489 loops=3)
                                                               Index Cond: ((indexed = ANY ('{YU,CA,SAN,LT,TO}'::text[])) AND (name_length >= 15) AND (name_length <= 20) AND (event_id < 45000000))
                                                               Heap Fetches: 688467
                                                               Buffers: shared hit=327861 read=148034
   ->  Index Scan using idx_id_ngram_content on NGRAM_CONTENT w  (cost=0.56..6.82 rows=1 width=381) (actual time=0.135..0.136 rows=1 loops=9437)
         Index Cond: (id = (max(w_1.id)))
         Filter: ((data IS NOT NULL) AND (w_1.ref = ref))
         Buffers: shared hit=35810 read=11430
 Planning Time: 12.075 ms
 Execution Time: 52100.064 ms

有沒有辦法讓這些查詢更快？

我試圖將查詢分成更小的塊並分析它們，並發現完全掃描發生在這個連接中

select max(w.id) as w_max_event_id, w.ref from NGRAMS w
    inner join (
            select max(event_id) as max_event_id, ref from NGRAMS  where
                name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
            on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref

但我不知道為什么，也不確定缺少哪些索引。

最好答案是 Postgres，但最壞的情況也請提供 Oracle 的答案。

我知道這很長，但如果可以的話，請盡量提供幫助。 謝謝

Answer 1

對於如此多樣的查詢，最好的辦法是創建三個索引：

CREATE INDEX ON ngrams (id);
CREATE INDEX ON ngrams (name_length);
CREATE INDEX ON ngrams (ngram);

並希望 PostgreSQL 可以使用Bitmap 並且如果其中一個條件不夠選擇性。

Postgres多個列上的多個謂詞

問題描述

1 個解決方案

解決方案1
0 2019-10-04 09:58:04

Postgres多個列上的多個謂詞

問題描述

1 個解決方案

解決方案1 0 2019-10-04 09:58:04

解決方案1
0 2019-10-04 09:58:04