Increase SQL Query Performance (MAX date)

Question

I was searching on how to get the latest occurences based on col1 and col2.

Let's suppose we have the following table (all rows needed are marked with *):

col1                   col2                    col3  
---------------------------------------------------------
002478                 ABC                 2019-08-23    *
002478                 ABC                 2019-05-14    
002588                 CVMG                2019-01-07    *
002588                 IP                  2019-01-31    *
002588                 MMG                 2019-09-04    *
002588                 MMG                 2019-08-28    
002588                 NUSA                2019-11-04    *
002588                 NUSA                2019-04-24    
002746                 IE                  2019-01-15    *
003467                 IE                  2020-01-10    
003467                 IE                  2020-03-13    *

I was able to get the latest occurences based on col1 and col2 with the following select.

SELECT t.col1, 
       t.col2, 
       t.col3
FROM 
       teste t
WHERE t.col3 IN (SELECT max(a.col3) 
                 FROM teste a 
                 WHERE a.col1 = t.col1 AND a.col2 = t.col2)

In this example, it only takes about 10 ~ 7 ms to complete, but on my real database, it takes about 1 hour .

I removed all JOINS that I could and the minimum time I've reached was about 55 minutes .

As I'm using Progress, there's no window function (that I'm aware of) like partition by .

There's another way to solve this problem? The only query I could think was on that "style".

Here's an SQL Fiddle with that example database.

Answer 1

Another way of writing the same query is to select the rows for which not excist a newer related row:

SELECT t.col1, t.col2, t.col3
FROM teste t
WHERE NOT EXISTS
(
  SELECT NULL
  FROM teste t_newer
  WHERE t_newer.col1 = t.col1
    AND t_newer.col2 = t.col2
    AND t_newer.col3 > t.col3
);

This may be faster or slower or equally fast. This depends on how your DBMS runs this internally.

With either of the two queries the DBMS faces the task to quickly look up other rows with the same col1 and col2. With only the table, the DBMS would have to sequentially read it again and again and again. This is where indexes come into play. You provide the DBMS with indexes, where it can look up where in the table are the matching rows.

In your case you want an index an col1 and col2, in order to provide a means to find the related rows. And you can also add col3, as this must be compared, too. Maybe it doesn't matter whether to start the index with col1 or col2, maybe it does. How many different col1 are in the table, how many different col2? If one has just 5 different values and the other 5,000, then start the index with the one with 5,000 values, because for one value you will find fewer rows, ie get faster to the rows you are interested in.

An index could then look like

create index idx on teste (col1, col2, col3);

The queries stay the same. The DBMS will look at your query and decide whether to use an index or not. For the given queries I am sure it will use the index mentioned, because the queries are all about quickly looking up related rows.

Increase SQL Query Performance (MAX date)

Question

1 answers

solution1
1 2020-08-21 11:57:20

Increase SQL Query Performance (MAX date)

Question

1 answers

solution1 1 2020-08-21 11:57:20

solution1
1 2020-08-21 11:57:20