[英]Postgres multiple predicates on multiple columns
編輯:
我想我會解釋我正在嘗試做的事情,以便有人可能比我所要求的更好地了解如何編寫查詢。
我有一個大約有 5 億行的表,另一個有大約 50M 行的表。
表定義如下
CREATE TABLE NGRAM_CONTENT
(
id BIGINT NOT NULL PRIMARY KEY,
ref TEXT NOT NULL,
data TEXT
);
CREATE INDEX idx_reference_ngram_content ON NGRAM_CONTENT (ref);
CREATE INDEX idx_id_ngram_content ON NGRAM_CONTENT (id);
CREATE TABLE NGRAMS
(
id BIGINT NOT NULL,
ngram TEXT NOT NULL,
ref TEXT NOT NULL,
name_length INT NOT NULL
);
CREATE INDEX combined_index ON NGRAMS (name_length, ngram, ref, id);
CREATE INDEX namelength_idx ON NGRAMS (name_length);
CREATE INDEX id_idx ON NGRAMS (id);
CREATE INDEX ref_idx ON NGRAMS (ref);
CREATE INDEX ngram_idx ON NGRAMS (ngram);
為了使用批量快速插入,已標記為已刪除的上游事件使用 null 插入null
表的數據列,並且沒有設置外部約束,但是 ngrams 表中的id
和ref
都是 NGRAM_CONTENT 的外鍵桌子。
一些樣本數據
Ngram_Content:
|id | ref | data |
| 1 | 'P1' | some_json |
| 2 | 'P1' | some_new_json | # P1 comes again as an update
| 3 | 'P2' | P3 |
| 4 | 'P1' | null |
Ngrams:
name_length | ngram | ref | id |
12 | CH | 'P1' | 1 |
12 | AN | 'P1' | 1 |
14 | NEW | 'P1' | 2 |
20 | CH | 'P2' | 3 |
20 | CHAI | 'P2' | 3 |
...
對於上述數據,如果我搜索 id <= 1 的 'CH' 或 'AN' 的 ngram,那么它將返回帶有內容some_json
的P1
但是如果我使用 id <= 2 搜索,那么它將不匹配,因為最新在id=2
處已更新為NEW
,如果我搜索 id <= 5 的NEW
,那么它也不會返回任何內容,因為最新的P1
已被刪除。
所有搜索都應在name_length
from 和 to 的距離內完成。
換句話說,只查找給定ref
的最新 ngram 內容,在name_length
的限制內沒有被刪除到某個id
我需要支持 2 個條件 1. 使用事件 id(用於歷史運行) 2. 沒有事件 id 使用最新的
所以我想出了2個這樣的變化
使用 event_id:
select w.* From NGRAM_CONTENT w
inner join (
select max(w.id) as w_max_event_id, w.ref from NGRAMS w
inner join (
select max(id) as max_event_id, ref from NGRAMS where
name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref) wi
on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;
沒有 event_id:
select w.* From NGRAM_CONTENT w
inner join (
select max(w.id) as w_max_event_id, w.ref from NGRAMS w
inner join (
select max(id) as max_event_id, ref from NGRAMS where
name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) group by ref having count(ref) >= a_threshold) i
on w.ref = i.ref where w.id >= i.max_event_id group by w.ref) wi
on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;
這兩個查詢都需要很長時間才能運行,並且在運行查詢解釋時,Postgres 顯示為完整掃描。
SEQ_SCAN (Seq Scan) table: NGAMS; 121494200 3358896.0 0.0 Node Type = Seq Scan;
Parent Relationship = Outer;
Parallel Aware = true;
Relation Name = NGRAMS;
Alias = w_1;
Startup Cost = 0.0;
Total Cost = 3358896.0;
Plan Rows = 121494200;
Plan Width = 16;
帶有execute (analyze, buffers) query
的詳細執行計划
Nested Loop (cost=5032852.92..6943974.42 rows=1 width=381) (actual time=50787.356..52095.938 rows=9437 loops=1)
Buffers: shared hit=149882 read=769965, temp read=732 written=736
-> Finalize GroupAggregate (cost=5032852.35..5125447.71 rows=265783 width=16) (actual time=50785.079..50808.811 rows=9437 loops=1)
Group Key: w_1.ref
Buffers: shared hit=114072 read=758535, temp read=732 written=736
-> Gather Merge (cost=5032852.35..5120132.05 rows=531566 width=16) (actual time=50785.072..50801.624 rows=10261 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
-> Partial GroupAggregate (cost=5031852.33..5057776.12 rows=265783 width=16) (actual time=50766.172..50777.757 rows=3420 loops=3)
Group Key: w_1.ref
Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
-> Sort (cost=5031852.33..5039607.65 rows=3102128 width=16) (actual time=50766.163..50769.734 rows=41777 loops=3)
Sort Key: w_1.ref
Sort Method: quicksort Memory: 3251kB
Worker 0: Sort Method: quicksort Memory: 3326kB
Worker 1: Sort Method: quicksort Memory: 3396kB
Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
-> Hash Join (cost=787482.50..4591332.06 rows=3102128 width=16) (actual time=14787.585..50749.022 rows=41777 loops=3)
Hash Cond: (w_1.ref = i.ref)
Join Filter: (w_1.id >= i.max_event_id)
Buffers: shared hit=343708 read=2276169, temp read=2196 written=2208
-> Parallel Seq Scan on NGRAMS w_1 (cost=0.00..3662631.50 rows=53797008 width=16) (actual time=0.147..30898.313 rows=38518899 loops=3)
Filter: (id < 45000000)
Rows Removed by Filter: 58676466
Buffers: shared hit=15819 read=2128135
-> Hash (cost=786907.78..786907.78 rows=45978 width=16) (actual time=14767.179..14767.180 rows=9437 loops=3)
Buckets: 65536 Batches: 1 Memory Usage: 955kB
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> Subquery Scan on i (cost=782779.42..786907.78 rows=45978 width=16) (actual time=14669.187..14764.701 rows=9437 loops=3)
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> GroupAggregate (cost=782779.42..786448.00 rows=45978 width=16) (actual time=14669.186..14763.369 rows=9437 loops=3)
Group Key: NGRAMS.ref
Filter: (count(NGRAMS.ref) >= 2)
Rows Removed by Filter: 210038
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> Sort (cost=782779.42..783265.52 rows=194442 width=16) (actual time=14669.164..14708.948 rows=229489 loops=3)
Sort Key: NGRAMS.ref
Sort Method: external merge Disk: 5856kB
Worker 0: Sort Method: external merge Disk: 5856kB
Worker 1: Sort Method: external merge Disk: 5856kB
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> Index Only Scan using combined_index on NGRAMS (cost=0.57..762373.68 rows=194442 width=16) (actual time=0.336..14507.098 rows=229489 loops=3)
Index Cond: ((indexed = ANY ('{YU,CA,SAN,LT,TO}'::text[])) AND (name_length >= 15) AND (name_length <= 20) AND (event_id < 45000000))
Heap Fetches: 688467
Buffers: shared hit=327861 read=148034
-> Index Scan using idx_id_ngram_content on NGRAM_CONTENT w (cost=0.56..6.82 rows=1 width=381) (actual time=0.135..0.136 rows=1 loops=9437)
Index Cond: (id = (max(w_1.id)))
Filter: ((data IS NOT NULL) AND (w_1.ref = ref))
Buffers: shared hit=35810 read=11430
Planning Time: 12.075 ms
Execution Time: 52100.064 ms
有沒有辦法讓這些查詢更快?
我試圖將查詢分成更小的塊並分析它們,並發現完全掃描發生在這個連接中
select max(w.id) as w_max_event_id, w.ref from NGRAMS w
inner join (
select max(event_id) as max_event_id, ref from NGRAMS where
name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref
但我不知道為什么,也不確定缺少哪些索引。
最好答案是 Postgres,但最壞的情況也請提供 Oracle 的答案。
我知道這很長,但如果可以的話,請盡量提供幫助。 謝謝
對於如此多樣的查詢,最好的辦法是創建三個索引:
CREATE INDEX ON ngrams (id);
CREATE INDEX ON ngrams (name_length);
CREATE INDEX ON ngrams (ngram);
並希望 PostgreSQL 可以使用Bitmap 並且如果其中一個條件不夠選擇性。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.