简体   繁体   中英

How to improve or speed up postgres query with pg_trgm?

Is there any additional steps what I can do to speed up query execution?

I have DB with more than 100m rows and i need to do full text search . For that I checked two options

  1. compare text with to_tsvector @@ (to_tsquery or plainto_tsquery)
  • this works very fast (under 1s on all data) but it has some problems with finding text similarity
  1. compare text with pg_trgm similarity.
  • this works fine on text comparison but works bad on large amount of data.

I found that I can use indexes for improving time.

about siglen I tried from small number up to 2024 but in some reason postgres uses 512 not higher.

CREATE INDEX trgm_idx_512_gg ON table USING GIST (name gist_trgm_ops(siglen=512));

This is EXPLAIN

"Bitmap Heap Scan on table (cost=1632.01..40051.57 rows=9737 width=126)"
"  Recheck Cond: ((name)::text % 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086'::text)"
"  ->  Bitmap Index Scan on trgm_idx_512_gg  (cost=0.00..1629.57 rows=9737 width=0)"
"        Index Cond: ((name)::text % 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086'::text)"

And this is query

SELECT name, similarity(name, 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086') as sm
FROM table
WHERE name % 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086' 

Execution time was about 120sec

Question

How can I improve or speed up query? Maybe I need to use a different approach or just add something else?

EDIT

EXPLAIN (ANALYZE, BUFFERS)

I use a different name so that the search is completely new and not from the cache

"Bitmap Heap Scan on table (cost=1632.01..40051.57 rows=9737 width=126) (actual time=159119.258..159960.251 rows=5645 loops=1)"
"  Recheck Cond: ((name)::text % 'Чехол на realme C25s / Реалми Ц25с c рисунком / прозрачный с принтом, Andy&Paul'::text)"
"  Heap Blocks: exact=3795"
"  Buffers: shared read=1289378"
"  ->  Bitmap Index Scan on trgm_idx_512_gg  (cost=0.00..1629.57 rows=9737 width=0) (actual time=159118.616..159118.616 rows=5645 loops=1)"
"        Index Cond: ((name)::text % 'Чехол на realme C25s / Реалми Ц25с c рисунком / прозрачный с принтом, Andy&Paul'::text)"
"        Buffers: shared read=1285583"
"Planning:"
"  Buffers: shared read=5"
"Planning Time: 4.063 ms"
"Execution Time: 159961.121 ms"

and also create GIN index, but postgres from all indexes used GIST

CREATE INDEX gin_gg ON table USING GIN (name gin_trgm_ops);

Your indexes are correct, besides that, here are some strategies in order of importance according to this specific case to tune your query performance:

  • In your execution plan can be noticed that buffers are being read instead of hit (means that data is missing from prostgres buffer cache and is read from disk) see Using buffers for query optimization .

    Issue a SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers'; to see the size of your database buffers (that number must be mutiplied by unit, 80k usually to see actual size in kbs).

    It's recommended 25% of server's RAM for buffers, and usually is left on it's defult value, that is 16384 * 8kb = 128Mb, but should be adapt to the situation. In tests i have seen a slight improvement (not much when storage is in SSDs) in query performance, you can change this parameter in postgresql.conf file and a database restart is required.

  • Peform a vacuum verbose analyze <table>; to check if there is data blocked by pending transactions (dead rows blocked) and at the same time execute a vacuum analyze operation, the PostgreSQL query planner relies on statistical information about the contents of tables in order to generate good plans for queries. These statistics are gathered by the ANALYZE command, which can be invoked by itself or as an optional step in VACUUM. It is important to have reasonably accurate statistics, otherwise poor choices of plans might degrade database performance, see vaccuming .

    For many installations, it is sufficient to let vacuuming be performed by the autovacuum daemon, but some database administrators will want to supplement or replace the daemon's activities with manually-managed VACUUM commands, which typically are executed according to a schedule by cron or Task Scheduler scripts. This updates the visibility map, which speeds up index-only scans (which in your case doesn't fully apply, because mainly, data is obtained by heap scan).

  • Could execute a cluster <table> using <index>; (when database is out of production because its a blocking method) to reorganize data for faster access, however for this use case i don't see performance improvement, see sql-cluster .

A trigram GiST index with siglen=512 on 100m rows is very large, and will probably never be cached efficiently. (Default is siglen=12 ie 12 bytes.) What makes you think this large signature would be a good choice?

I have better experience with a trigram GIN indexes, especially in current versions of Postgres. If the query planner is confused by the existence of an additional GiST index, you'll have to remove that one, to test results with the GIN index.

But first, to get a size comparison, look at the output of:

SELECT i.indexrelid::regclass::text AS idx
     , pg_get_indexdef(i.indexrelid) AS idx_def
     , pg_size_pretty(pg_relation_size(i.indexrelid)) AS idx_size
FROM   pg_class t
JOIN   pg_index i ON i.indrelid = t.oid
WHERE  t.relnamespace = 'public'::regnamespace
AND    t.relname = 'big'
ORDER  BY 1;

(Ideally, add the result to the questions.)

Your query plan shows vast amounts of Buffers: shared read for index and main relation (heap). So nothing was found in cache. The key to better performance will be to read fewer data pages to satisfy your queries, and more of them from cache: hit instead of read in the query plan.

Reducing the size of table and indexes helps in this reagard.

The selectivity of the trigram similarity operator % is set by the customized option pg_trgm.similarity_threshold . The default 0.3 is rather lax and allows many hits. A higher similarity threshold will filter fewer (better matching) result rows. What do you do with rows=5645 result rows anyway? Try:

SET pg_trgm.similarity_threshold = 0.5;  -- or higher

Then retry your query.
See:

The latest version or Postgres, better server configuration and more RAM can also help in this regard. You disclosed no information about either of these.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM