简体   繁体   English

无法通过索引提高 SQL 连接速度

[英]Can't improve SQL join speed with indexes

I'm totally new to SQL and I am trying to speed up join queries for very large data.我对 SQL 完全陌生,我正在尝试加快对非常大数据的连接查询。 I started adding indexes (but to be honest, I don't have a deep understanding of them) and not seeing much change, I decided to benchmark on a more simple, simulated example.我开始添加索引(但老实说,我对它们没有深入的了解)并且没有看到太大的变化,我决定在一个更简单的模拟示例上进行基准测试。 I'm using the psql interface of PostgreSQL 11.5 on MacOS 10.14.6.我在 MacOS 10.14.6 上使用 PostgreSQL 11.5 的 psql 接口。 The data server is hosted locally on my computer.数据服务器本地托管在我的计算机上。 I apologize for any lack of relevant information, first time posting about SQL.对于缺乏相关信息,我深表歉意,第一次发布有关 SQL 的信息。

Databases' Structures数据库的结构

I created two initially identical databases, db and db_idx.我创建了两个最初相同的数据库,db 和 db_idx。 I never put any index or key on tables in db, while I try putting indexes and keys on tables in db_idx.我从来没有在 db 中的表上放置任何索引或键,而我尝试在 db_idx 中的表上放置索引和键。 I then run simple join queries within db and db_idx separately and I compare the performance.然后我分别在 db 和 db_idx 中运行简单的连接查询并比较性能。 Specifically, db_idx is made of two tables:具体来说,db_idx 由两个表组成:

  • A client table with with 100,000 rows and the following structure:具有 100,000 行和以下结构的客户端表:
                   Table "public.client"
       Column    |  Type   | Collation | Nullable | Default
    -------------+---------+-----------+----------+---------
     client_id   | integer |           | not null |
     client_name | text    |           |          |
    Indexes:
        "pkey_c" PRIMARY KEY, btree (client_id)
  • A client_additional table with 70,000 rows and the following structure:具有 70,000 行和以下结构的 client_additional 表:
             Table "public.client_additional"
       Column   |  Type   | Collation | Nullable | Default
    ------------+---------+-----------+----------+---------
     client_id  | integer |           | not null |
     client_age | integer |           |          |
    Indexes:
        "pkey_ca" PRIMARY KEY, btree (client_id)
        "cov_idx" btree (client_id, client_age)

The client_id column in the client_additional table contains a subset of client's client_id values. client_additional 表中的 client_id 列包含客户端的 client_id 值的子集。 Note the primary keys, and the other index I created on client_additional.请注意主键和我在 client_additional 上创建的其他索引。 I thought these would increase the benchmark query speed (see below) but it did not.我认为这些会提高基准查询速度(见下文),但事实并非如此。

Importantly the db database is exactly the same (same structure, same values) except that it has no index or key .重要的是 db 数据库完全相同(相同的结构,相同的值),只是它没有 index 或 key

Side note: the client and client_additional table should perhaps be a single table, since they give information at exactly the same level (client level).旁注: client 和 client_additional 表可能应该是一个表,因为它们在完全相同的级别(客户端级别)提供信息。 However, the database I'm using in real life came structured this way: some tables are split into several tables by "topic" although they give information at the same level.然而,我在现实生活中使用的数据库的结构是这样的:一些表被“主题”分成几个表,尽管它们提供相同级别的信息。 I don't know if that matters for my issue.我不知道这对我的问题是否重要。

Benchmark Query基准查询

I'm using the following query, which mimics a lot what I need to do with real data:我正在使用以下查询,它模仿了很多我需要对真实数据执行的操作:

    SELECT 
      client_additional.client_id, 
      client_additional.client_age,
      client.client_name
    FROM client
    INNER JOIN client_additional 
    ON client.client_id = client_additional.client_id;

Benchmark Results基准测试结果

On both databases, the benchmark query takes about 630 ms.在这两个数据库上,基准查询大约需要 630 毫秒。 Removing the keys and/or indexes in db_idx does not change anything.删除 db_idx 中的键和/或索引不会改变任何内容。 These benchmark results carry over to larger data sizes: speed is identical in the indexed and non-indexed cases.这些基准测试结果适用于更大的数据量:索引和非索引情况下的速度相同。

That's where I am.这就是我所在的地方。 How do I explain these results?我如何解释这些结果? Can I improve the join speed and how?我可以提高加入速度吗?如何提高?

Use the EXPLAIN verb to see how the SQL engine intends to resolve the query.使用EXPLAIN动词查看 SQL 引擎打算如何解析查询。 (Different SQL engines present this in different ways.) You can conclusively see whether the index will be used. (不同的 SQL 引擎以不同的方式呈现这一点。)您可以最终确定是否会使用该索引。

Also, you'll first need to load the tables with a lot of test data, because EXPLAIN will tell you what the SQL engine intends to do right now, and this decision is based in part on the size of the table and various other statistics.此外,您首先需要加载包含大量测试数据的表,因为EXPLAIN会告诉您 SQL 引擎现在打算做什么,这个决定部分基于表的大小和其他各种统计数据. If the table is virtually empty, the SQL engine might decide that the index wouldn't be helpful now.如果表实际上是空的,则 SQL 引擎可能会认为索引现在没有用处

SQL engines use all kinds of very clever tricks to optimize performance, so it's actually rather difficult to get a useful timing test. SQL 引擎使用各种非常巧妙的技巧来优化性能,因此实际上很难获得有用的时序测试。 But, if EXPLAIN tells you that the index is being used, that's pretty much the answer that you're looking for.但是,如果EXPLAIN告诉您正在使用该索引,那么这几乎就是您正在寻找的答案。

You have a primary key on the two tables which will be used for the join s.您在两个表上有一个主键,将用于join If you want to really see the queries slow down, remove the primary keys.如果您想真正看到查询变慢,请删除主键。

What is happening?怎么了? Well, my guess is that the execution plans are the same with or without the secondary indexes.好吧,我的猜测是,无论有没有二级索引,执行计划都是一样的。 You would need to look at the plans themselves.您需要查看计划本身。

Unlike most other databases, Postgres does not get a benefit from covering indexes, because lock information is stored in the data pages only.与大多数其他数据库不同,Postgres 并没有从覆盖索引中受益,因为锁信息只存储在数据页中。 So, the data pages always need to be accessed.因此,总是需要访问数据页。

Setting up a small test DB, adding some rows and running your query:设置一个小型测试数据库,添加一些行并运行您的查询:

CREATE TABLE client
(
   client_id integer PRIMARY KEY,
   client_name text
);

CREATE TABLE client_additional
(
   client_id integer PRIMARY KEY,
   client_age integer
);

INSERT INTO client (client_id, client_name) VALUES (generate_series(1,100000),'Phil');
INSERT INTO client_additional (client_id, client_age) VALUES (generate_series(1,70000),21);

ANALYZE;

EXPLAIN ANALYZE SELECT 
   client_additional.client_id, 
   client_additional.client_age,
   client.client_name
FROM
   client
INNER JOIN
   client_additional 
ON
   client.client_id = client_additional.client_id;

gave me this plan:给了我这个计划:

 Hash Join  (cost=1885.00..3590.51 rows=70000 width=11) (actual time=158.958..44 1.222 rows=70000 loops=1)
   Hash Cond: (client.client_id = client_additional.client_id)
   ->  Seq Scan on client  (cost=0.00..1443.00 rows=100000 width=7) (actual time =0.019..100.318 rows=100000 loops=1)
   ->  Hash  (cost=1010.00..1010.00 rows=70000 width=8) (actual time=158.785..15 8.786 rows=70000 loops=1)
         Buckets: 131072  Batches: 1  Memory Usage: 3759kB
         ->  Seq Scan on client_additional  (cost=0.00..1010.00 rows=70000 width =8) (actual time=0.016..76.507 rows=70000 loops=1)
 Planning Time: 0.357 ms
 Execution Time: 506.739 ms

What you can see from this is both tables were sequentially scanned, the values from each table were hashed and a hash join was done.从中可以看出,两个表都被顺序扫描,每个表中的值都经过哈希处理,并完成了 hash 连接。 Postgres determined this was the optimal way to execute this query. Postgres 确定这是执行此查询的最佳方式。

If you were to recreate the tables without the Primary Key (and therefore remove the implicit index on the PK column of each), you get exactly the same plan, as Postgres has determined that the quickest way to execute this query is by ignoring the indexes and by hashing the table's values then doing a hash join on the two sets of hashed values to get the result.如果您要重新创建没有主键的表(并因此删除每个表的 PK 列上的隐式索引),您将获得完全相同的计划,因为 Postgres 已确定执行此查询的最快方法是忽略索引并通过散列表的值,然后对两组散列值进行 hash 连接以获得结果。

After changing the number of rows in the client table like so:像这样更改客户端表中的行数后:

TRUNCATE Client;

INSERT INTO client (client_id, client_name) VALUES (generate_series(1,200000),'phil');

ANALYZE;

Then I re-ran the same query and I see this plan instead:然后我重新运行了相同的查询,我看到了这个计划:

Merge Join  (cost=1.04..5388.45 rows=70000 width=13) (actual time=0.050..415.50
3 rows=70000 loops=1)
   Merge Cond: (client.client_id = client_additional.client_id)
   ->  Index Scan using client_pkey on client  (cost=0.42..6289.42 rows=200000 width=9) (actual time=0.022..86.897 rows=70001 loops=1)
   ->  Index Scan using client_additional_pkey on client_additional  (cost=0.29..2139.29 rows=70000 width=8) (actual time=0.016..86.818 rows=70000 loops=1)
 Planning Time: 0.517 ms
 Execution Time: 484.264 ms

Here you can see that index scans were done, as Postgres has determined that this plan is a better one based on the current number of rows in the tables.在这里你可以看到索引扫描已经完成,因为 Postgres 根据表中的当前行数确定这个计划是一个更好的计划。

The point is that Postgres will use the indexes when it feels they will produce a faster result, but the thresholds before they are used are somewhat higher than you may have expected.关键是 Postgres 会在感觉索引会产生更快的结果时使用索引,但是使用它们之前的阈值比您预期的要高一些。

All best,万事如意,

Phil菲尔

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM