简体   繁体   中英

How can I optimise this LIKE JOIN query?

This query finds the suffix of a domain:

        SELECT
        DISTINCT ON ("companyDomain".id)
            "companyDomain".domain,
            "publicSuffix".suffix
        FROM
            "companyDomain"
        INNER JOIN
            "publicSuffix"
        ON
            REVERSE("companyDomain".domain) LIKE REVERSE("publicSuffix".suffix) || '%'
        ORDER BY
            "companyDomain".id, LENGTH("publicSuffix".suffix) DESC

Edit: Notice this also works with subdomains.

You can fiddle with the example here and visualize the plan with pev . I've tried adding covering indexes to the tables but they end up not being used by the query planner. Also perhaps there's another query that could be more efficient?

There is no advantages of indexes for your data structure/query. Just try to imagine how indexes could be used here. I have no luck.

My suggestion is to convert domains/suffixes to arrays like

alter table "companyDomain" add column adomain text[];
update "companyDomain" set adomain = string_to_array(domain, '.');
create index idx_adom on "companyDomain" using gin (adomain array_ops);

alter table "publicSuffix" add column asuffix text[];
update "publicSuffix" set asuffix = string_to_array(ltrim(suffix, '.'), '.');
create index idx_asuffix on "publicSuffix" using gin (asuffix array_ops);

Lets compare those queries:

ostgres=# explain (analyze, verbose, buffers)
SELECT  DISTINCT ON ("companyDomain".id)
    "companyDomain".domain,
    "publicSuffix".suffix
FROM
    "companyDomain"
        INNER JOIN "publicSuffix" ON REVERSE("companyDomain".domain) LIKE REVERSE("publicSuffix".suffix) || '%'
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN                                                                   │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=185738.35..185940.72 rows=908 width=31) (actual time=2364.720..2364.890 rows=908 loops=1)                                        │
│   Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                           │
│   Buffers: shared hit=306                                                                                                                      │
│   ->  Sort  (cost=185738.35..185839.53 rows=40474 width=31) (actual time=2364.719..2364.764 rows=1006 loops=1)                                 │
│         Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                     │
│         Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                             │
│         Sort Method: quicksort  Memory: 103kB                                                                                                  │
│         Buffers: shared hit=306                                                                                                                │
│         ->  Nested Loop  (cost=0.00..182641.13 rows=40474 width=31) (actual time=22.735..2364.484 rows=1006 loops=1)                           │
│               Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text)                 │
│               Join Filter: (reverse(("companyDomain".domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                  │
│               Rows Removed by Join Filter: 8093814                                                                                             │
│               Buffers: shared hit=306                                                                                                          │
│               ->  Seq Scan on public."publicSuffix"  (cost=0.00..377.15 rows=8915 width=12) (actual time=0.081..0.794 rows=8915 loops=1)       │
│                     Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix                        │
│                     Buffers: shared hit=288                                                                                                    │
│               ->  Materialize  (cost=0.00..31.62 rows=908 width=15) (actual time=0.001..0.036 rows=908 loops=8915)                             │
│                     Output: "companyDomain".domain, "companyDomain".id                                                                         │
│                     Buffers: shared hit=18                                                                                                     │
│                     ->  Seq Scan on public."companyDomain"  (cost=0.00..27.08 rows=908 width=15) (actual time=11.576..11.799 rows=908 loops=1) │
│                           Output: "companyDomain".domain, "companyDomain".id                                                                   │
│                           Buffers: shared hit=18                                                                                               │
│ Planning Time: 0.167 ms                                                                                                                        │
│ JIT:                                                                                                                                           │
│   Functions: 9                                                                                                                                 │
│   Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                │
│   Timing: Generation 1.956 ms, Inlining 0.000 ms, Optimization 0.507 ms, Emission 10.878 ms, Total 13.341 ms                                   │
│ Execution Time: 2366.971 ms                                                                                                                    │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The bottleneck here, as I understand, Rows Removed by Join Filter: 8093814

It seems that PostgreSQL builds the cartesian join of tables and then filters it using ON condition:

select count(*) from "companyDomain", "publicSuffix";
---
8094820

To workaround try to use array operator :

postgres=# explain (analyze, verbose, buffers)
SELECT  DISTINCT ON ("companyDomain".id)
    "companyDomain".domain,
    "publicSuffix".suffix
FROM
    "companyDomain"
        INNER JOIN "publicSuffix" ON adomain @> asuffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                 QUERY PLAN                                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=8310.60..8512.97 rows=908 width=31) (actual time=180.149..180.335 rows=908 loops=1)                                           │
│   Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                        │
│   Buffers: shared hit=48986                                                                                                                 │
│   ->  Sort  (cost=8310.60..8411.78 rows=40474 width=31) (actual time=180.148..180.200 rows=1239 loops=1)                                    │
│         Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                  │
│         Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                          │
│         Sort Method: quicksort  Memory: 145kB                                                                                               │
│         Buffers: shared hit=48986                                                                                                           │
│         ->  Nested Loop  (cost=0.59..5213.39 rows=40474 width=31) (actual time=0.190..179.693 rows=1239 loops=1)                            │
│               Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text)              │
│               Buffers: shared hit=48986                                                                                                     │
│               ->  Seq Scan on public."companyDomain"  (cost=0.00..27.08 rows=908 width=57) (actual time=0.015..0.098 rows=908 loops=1)      │
│                     Output: "companyDomain".id, "companyDomain".domain, "companyDomain".created_at, "companyDomain".adomain                 │
│                     Buffers: shared hit=18                                                                                                  │
│               ->  Bitmap Heap Scan on public."publicSuffix"  (cost=0.59..5.15 rows=45 width=54) (actual time=0.052..0.197 rows=1 loops=908) │
│                     Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix                     │
│                     Recheck Cond: ("companyDomain".adomain @> "publicSuffix".asuffix)                                                       │
│                     Rows Removed by Index Recheck: 572                                                                                      │
│                     Heap Blocks: exact=41510                                                                                                │
│                     Buffers: shared hit=48968                                                                                               │
│                     ->  Bitmap Index Scan on idx_asuffix  (cost=0.00..0.58 rows=45 width=0) (actual time=0.039..0.039 rows=573 loops=908)   │
│                           Index Cond: ("publicSuffix".asuffix <@ "companyDomain".adomain)                                                   │
│                           Buffers: shared hit=7458                                                                                          │
│ Planning Time: 0.189 ms                                                                                                                     │
│ Execution Time: 180.434 ms                                                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It is probably not too accurate (for example aaa.bbb is equal to bbb.aaa here) but you can fix it in WHERE clause. In any case it will be faster.

And for now the old domain and suffix columns are redundant because you can to restore them from adomain/asuffix using array_to_string(anyarray, text [, text]) function .

As an alternative, to avoid changes in tables structure, you can to create functional indexes on string_to_array() and then using it in the filters/joins.

Have you considered using a gin index?

I made the following modifications to your sample DML:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
...
CREATE INDEX companyDomain_domain_reverse ON "companyDomain" USING gin (REVERSE(domain) gin_trgm_ops);
...
CREATE INDEX publicSuffix_suffix_reverse ON "publicSuffix" USING gin (REVERSE(suffix) gin_trgm_ops);

And here is the query plan:

+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique  (cost=40802.07..41004.44 rows=908 width=31) (actual time=98.229..98.356 rows=908 loops=1)                                                       |
|  ->  Sort  (cost=40802.07..40903.26 rows=40474 width=31) (actual time=98.228..98.264 rows=1006 loops=1)                                                |
|        Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                                      |
|        Sort Method: quicksort  Memory: 103kB                                                                                                           |
|        ->  Nested Loop  (cost=0.05..37704.86 rows=40474 width=31) (actual time=1.655..97.976 rows=1006 loops=1)                                        |
|              ->  Seq Scan on "publicSuffix"  (cost=0.00..151.15 rows=8915 width=12) (actual time=0.011..0.728 rows=8915 loops=1)                       |
|              ->  Bitmap Heap Scan on "companyDomain"  (cost=0.05..4.15 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915)                   |
|                    Recheck Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                                    |
|                    Rows Removed by Index Recheck: 0                                                                                                    |
|                    Heap Blocks: exact=301                                                                                                              |
|                    ->  Bitmap Index Scan on companydomain_domain_reverse  (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
|                          Index Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                                |
|Planning Time: 0.150 ms                                                                                                                                 |
|Execution Time: 98.439 ms                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+

As a bonus - you do not even need to to REVERSE() the text in the index and in the query:

create index companydomain_domain
    on "companyDomain" using gin(domain gin_trgm_ops);



SELECT DISTINCT ON ("companyDomain".id) "companyDomain".domain, "publicSuffix".suffix
FROM "companyDomain"
         INNER JOIN "publicSuffix" ON "companyDomain".domain LIKE '%' || "publicSuffix".suffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC

The query takes the same amount of time and still uses the gin index:

+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique  (cost=40556.91..40759.28 rows=908 width=31) (actual time=96.170..96.315 rows=908 loops=1)                                               |
|  ->  Sort  (cost=40556.91..40658.10 rows=40474 width=31) (actual time=96.169..96.209 rows=1006 loops=1)                                        |
|        Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                              |
|        Sort Method: quicksort  Memory: 103kB                                                                                                   |
|        ->  Nested Loop  (cost=0.05..37459.70 rows=40474 width=31) (actual time=1.764..95.919 rows=1006 loops=1)                                |
|              ->  Seq Scan on "publicSuffix"  (cost=0.00..151.15 rows=8915 width=12) (actual time=0.009..0.711 rows=8915 loops=1)               |
|              ->  Bitmap Heap Scan on "companyDomain"  (cost=0.05..4.12 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915)           |
|                    Recheck Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text))                                              |
|                    Rows Removed by Index Recheck: 0                                                                                            |
|                    Heap Blocks: exact=301                                                                                                      |
|                    ->  Bitmap Index Scan on companydomain_domain  (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
|                          Index Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text))                                          |
|Planning Time: 0.132 ms                                                                                                                         |
|Execution Time: 96.393 ms                                                                                                                       |
+------------------------------------------------------------------------------------------------------------------------------------------------+

PS: I guess you need only one of the indexes - in this case: companyDomain_domain_reverse

You want a match like

'something.google.com' like '%google.com'

But you know that PostgreSQL won't use an index for that, because the pattern string starts with a wildcard. So you reverse both strings:

'moc.elgoog.gnihtemos' like 'moc.elgoog%'

and create a function index on REVERSE("companyDomain".domain) .

This is a very good idea, but PostgreSQL doesn't use your index. This is because the DBMS doesn't know what is in your strings (as this is table data and the DBMS won't read the whole table first to get to a plan). In the worst case all reversed suffixes would start with '%' . If the DBMS decided to go through an index in that case, this could get extremely slow. You know that the suffixes don't end on '%' , but the DBMS does not and decides for a safe plan (a full table scan).

This is documented here: https://www.postgresql.org/docs/9.2/indexes-types.html

The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant ...

I see no way to convince PostgreSQL that it is safe to use the index. AND REVERSE("publicSuffix".suffix) || '%' NOT LIKE '/%%' ESCCAPE '/' AND REVERSE("publicSuffix".suffix) || '%' NOT LIKE '/%%' ESCCAPE '/' doesn't help for instance.

In my opinion, your best bet is to use indexes on RIGHT(domain, 3) and RIGHT(suffix, 3) , because we know suffixes including the dot to be at least three characters long. This can narrow the matches enough to be useful.

CREATE INDEX idx_publicSuffix_suffix3 ON "publicSuffix"(RIGHT(suffix, 3) varchar_pattern_ops, suffix);

CREATE INDEX idx_companyDomain_domain3 ON "companyDomain"(RIGHT(domain, 3) varchar_pattern_ops, id, domain);

SELECT DISTINCT ON (cd.id)
  cd.domain,
  ps.suffix
FROM "companyDomain" cd
JOIN "publicSuffix" ps ON cd.domain LIKE '%' || ps.suffix
                       AND RIGHT(cd.domain, 3) = RIGHT(ps.suffix, 3)
ORDER BY cd.id, LENGTH(ps.suffix) DESC;

Demo: https://www.db-fiddle.com/f/dPpVFWjpVJHYFnVut4k7wS/1

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
¦                                                                   QUERY PLAN                                                                                                     ¦
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
¦ Unique  (cost=1684.72..1685.71 rows=198 width=72) (actual time=165.676..165.882 rows=908 loops=1)                                                                                ¦
¦     Buffers: shared hit=4079                                                                                                                                                     ¦
¦     ->  Sort  (cost=1684.72..1685.22 rows=198 width=72) (actual time=165.675..165.723 rows=1006 loops=1)                                                                         ¦
¦           Sort Key: cd.id, (length((ps.suffix)::text)) DESC                                                                                                                      ¦
¦           Sort Method: quicksort Memory: 103kB                                                                                                                                   ¦
¦           Buffers: shared hit=4079                                                                                                                                               ¦
¦           ->  Merge Join  (cost=0.56..1677.17 rows=198 width=72) (actual time=0.090..165.222 rows=1006 loops=1)                                                                  ¦
¦                 Buffers: shared hit=4076                                                                                                                                         ¦
¦                 ->  Index Only Scan using idx_companydomain_domain3 on companyDomain cd  (cost=0.28..93.23 rows=1130 width=36) (actual time=0.018..0.429 rows=908 loops=1)       ¦
¦                       Heap Fetches: 908                                                                                                                                          ¦
¦                       Buffers: shared hit=109                                                                                                                                    ¦
¦                 ->  Materialize  (cost=0.28..602.89 rows=7006 width=32) (actual time=0.019..47.510 rows=390620 loops=1)                                                          ¦
¦                       Buffers: shared hit=3967                                                                                                                                   ¦
¦                       ->  Index Only Scan using idx_publicsuffix_suffix3 on publicSuffix ps  (cost=0.28..585.37 rows=7006 width=32) (actual time=0.015..2.798 rows=8354 loops=1) ¦
¦                             Heap Fetches: 8354                                                                                                                                   ¦
¦                             Buffers: shared hit=3967                                                                                                                             ¦
¦ Planning time: 0.471 ms                                                                                                                                                          ¦
¦ Execution time: 166.054 ms                                                                                                                                                       ¦
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

How about:

SELECT 
  DISTINCT ON ("companyDomain".id) "companyDomain".domain, 
  "publicSuffix".suffix 
FROM 
  "companyDomain" 
  INNER JOIN "publicSuffix" ON RIGHT(
    domain, 
    - POSITION('.' IN domain) + 1
  ) = "publicSuffix".suffix 
ORDER BY 
  "companyDomain".id, 
  LENGTH("publicSuffix".suffix) DESC;

We get the position of the first . in the domain, then use the negative value of that (+1 to include the first . ) to extract the suffix from RIGHT to left.

Looks like it runs much faster, from 2500ms to 120ms.

Live test

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM