简体   繁体   中英

SQL Server table with 100k records, 2 Inner Joins Extremely Slow

I'm migrating my data from SQL Server to Postgres.

I'm changing my table structure to handle generic sports matches, but it is giving me performance problems.

I have the following tables:

  • matches (id, start_time)
  • match_teams (id, match_id, team_id, score)
  • match_players (id, lineup_id, player_id), where lineup_id is a foreign key on match_teams.id

I'm selecting all matches with the following query:

SELECT * FROM matches AS m
INNER JOIN match_teams AS t ON m.id = t.match_id
INNER JOIN match_players AS p ON t.id = p.lineup_id

With 100k records, this query takes around 6 minutes:

-- Executing query:
SELECT * FROM matches AS m
INNER JOIN match_teams AS t ON m.id = t.match_id
INNER JOIN match_players AS p ON t.id = p.lineup_id
Total query runtime: 336360 ms.
1142078 rows retrieved.

On SQL Server, I had all of this data in one table and it would return in less than 5 seconds. In Postgres, I have also put this data into 1 table with jsonb, and was able to run the above query in 40 seconds.

How can I make this query faster? I would like to get it down to seconds.

Reading online I have found that creating indexes can speed up these joins. I've made the following indexes:

CREATE INDEX match_teams_match_id_idx ON match_teams USING btree (match_id);
CREATE INDEX match_players_lineup_id_idx ON match_players USING btree (lineup_id);
CREATE INDEX match_players_player_id_idx ON match_players USING btree (player_id);
CREATE INDEX matches_id_idx ON matches USING btree (id);

These indexes haven't made the query faster at all. Am I missing one?

Here's the EXPLAIN ANALYSE VERBOSE output of the above query:

"Hash Join  (cost=19314.10..67893.04 rows=1135917 width=24) (actual time=401.225..1624.906 rows=1142078 loops=1)"
"  Output: m.id, m.start_time, t.team_id, t.rank, p.player_id"
"  Hash Cond: (p.lineup_id = t.id)"
"  ->  Seq Scan on public.match_players p  (cost=0.00..19818.78 rows=1142078 width=8) (actual time=0.039..356.168 rows=1142078 loops=1)"
"        Output: p.player_id, p.lineup_id"
"  ->  Hash  (cost=15119.58..15119.58 rows=228442 width=24) (actual time=401.123..401.123 rows=228442 loops=1)"
"        Output: m.id, m.start_time, t.team_id, t.rank, t.id"
"        Buckets: 8192  Batches: 4  Memory Usage: 3358kB"
"        ->  Hash Join  (cost=5097.97..15119.58 rows=228442 width=24) (actual time=74.766..310.864 rows=228442 loops=1)"
"              Output: m.id, m.start_time, t.team_id, t.rank, t.id"
"              Hash Cond: (t.match_id = m.id)"
"              ->  Seq Scan on public.match_teams t  (cost=0.00..3519.42 rows=228442 width=16) (actual time=0.004..64.580 rows=228442 loops=1)"
"                    Output: t.team_id, t.rank, t.match_id, t.id"
"              ->  Hash  (cost=3112.21..3112.21 rows=114221 width=12) (actual time=74.728..74.728 rows=114221 loops=1)"
"                    Output: m.id, m.start_time"
"                    Buckets: 16384  Batches: 2  Memory Usage: 2682kB"
"                    ->  Seq Scan on public.matches m  (cost=0.00..3112.21 rows=114221 width=12) (actual time=0.003..34.789 rows=114221 loops=1)"
"                          Output: m.id, m.start_time"
"Planning time: 0.448 ms"
"Execution time: 1799.412 ms"

Update

Added DDL Here: http://pastie.org/10529040

Update 2

Postgres is running on an AWS RDS Server. I tried running the above query on a clean EC2 server and a clean PGAdmin install. I got the same results, appears to run query in ~2sec but takes ~6min to display the data.

Update 3

I tried running this query from a simple C# program and the results were returned in around 10 seconds. This appears to be an issue with PGAdmin.

Postgres has a very smart query engine. I use Postgres and often I have heard from others how "Postgres is slow" - however I have never experienced this. It just maybe doesn't have the defaults that other DBMSs may have and so you just need to understand optimization.

Couple stable points:

  • All tables should have a primary key and it needs to have a CONSTRAINT as a PRIMARY KEY
  • Any large table on which you are JOINing, ORDER BYing, GROUP BYing will optimize if you put a btree index on it (as you stated above)

Because the engine "thinks" for itself, you often need to tell Postgres to "reanalyze the table" once you put an index onto it. Also, you sometimes need to "vaccum" it to get any "dead" rows out of it that Postgres thinks it has.

To do this, do:

vacuum [schema].[table_name];
analyze [schema].[table_name];

Do this AFTER you have put your indexes on it, and it should greatly speed up your query execution.

You do NOT need to put an index on a PRIMARY KEY as the PRIMARY KEY CONSTRAINT creates an automatic unique INDEX. 您不需要在PRIMARY KEY上放置索引,因为PRIMARY KEY CONSTRAINT会创建一个自动唯一INDEX。

As long as you have a PRIMARY KEY constraint on the match_teams table ( id ), then you should only need to put a btree index on the match_teams table like so:

CREATE INDEX match_teams_match_id_idx ON match_teams USING btree (match_id);

As @a_horse_with_no_name and @David Aldridge pointed out, the query ran in ~2 seconds on the server but took ~6min to actually display the results in PGAdmin.

I tried running the same query from an AWS server (different network) with the same results.

I then tried running this query locally from a simple C# program and the results were returned in around 10 seconds.

This appears to be an issue with PGAdmin.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM