I have a query where one table has ~10 million rows and the other two are <20 in each table.
SELECT a.name, b.name, c.total
FROM smallTable1 a, smallTable2 b, largeTable c
WHERE c.id1 = a.id AND c.id2 = b.id;
largeTable
has columns (id, id1, id2, total)
and ~10 million rows
smallTable1
has columns (id, name)
smallTable2
has columns (id, name)
Right now it takes 5 seconds to run.
Is it possible to make it much faster?
Create indexes - they are the reason why querying is fast. Without indexes, we would be stuck with CPU-only solutions.
So:
Important : You can create index for more than one column at the same time, like this LargeTable(id1,id2) <--- DO NOT DO THAT because it does not make sense in your case.
Next , your query is not out of the box wrong, but it does not follow the best practice querying. Relational databases are based on Set theory . Therefore, you must think in terms of "bags with marbles" instead of "cells in a table". Roughly, your initial query translates to:
Ambrish has suggested the correct query, use that although this will not be faster.
Why? Because in the end, you still pull all the data from the table out of the database.
As for the data itself goes: 10 million records is not ridiculously large table, but it is not small either. In data warehouses, the star schema is a standard. And you have a star schema basically. The problem you are actually facing is that the result has to be calculated on-the-fly and that takes time. The reason i'm telling you this is because in corporate environments, engineers are facing this problems on a daily basis. And the solution is OLAP (basically pre-calculated, pre-aggregated, pre-summarized, pre-everything data). The end users then just query this precalculated data and the query seems very fast, but it is never 100% correct, because there is a delay between OLTP (on-line transactional processing = day to day database) and OLAP (on-line analytical processing = reporting database) The indexes will help with queries such as WHERE id = 3 etc. But when you are cross joining and basically pulling everything from DB, it probably wouldn't play a significant role in your case.
So to make long story short: if your only options are queries, it will be hard to make an improvement.
There is one circumstance under which separately indexing ID1
and ID2
in the large table will make less of a difference. If there are 9,000,000 rows with ID1
matching SmallTable1.id
and 200 rows with ID2
matching SmallTable2.id
, with the 200 being the only rows where both exist at the same time, you will still be doing almost a complete table/index scan. If that is the case, creating an index on both ID1
and ID2
should speed things up as it can then locate those 200 rows with index seeks.
If that works, you may want to include Total
in that index to make it a covering index for that table.
This solution (assuming it is one) would be extremely data-centric and thus the execution would change if the data changes significantly.
Whatever you decide to do, I would suggest you make one change (create an index or whatever) then check the execution plan. Make another change and check the execution plan. Make another change and check the execution plan. Repeat or rewind as needed.
Use join instead of WHERE
clause
SELECT a.name, b.name, c.total
FROM smallTable1 a join largeTable c on c.id1 = a.id
join smallTable2 b on c.id2 = b.id;
And create index
on largeTable(id1)
and largeTable(id2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.