简体   繁体   中英

DB2 SQL: fastest way to get lagged value of many columns

There are many ways to get a lagged value of a certain column in SQL, eg:

WITH CTE AS (
  SELECT
    rownum = ROW_NUMBER() OVER (ORDER BY columns_to_order_by),
    value
  FROM table
)
SELECT
  curr.value - prev.value
FROM CTE cur
INNER JOIN CTE prev on prev.rownum = cur.rownum - 1

,or:

select variable_of_interest 
               ,lag(variable_of_interest ,1) 
                    over(partition by
                    some_group order by variable_1,...,variable_n) 
                    as lag_variable_of_interest
from DATA

I use the second version, but my code runs very slow when "lagging" many variables, such that my code becomes:

select        variable_of_interest_1
              ,variable_of_interest_2
              ,variable_of_interest_3
                   ,lag(variable_of_interest_1 ,1) 
                        over(partition by
                        some_group order by variable_1,...,variable_n) 
                        as lag_variable_of_interest_1
                    ,lag(variable_of_interest_2 ,1) 
                        over(partition by
                        some_group order by variable_1,...,variable_n) 
                        as lag_variable_of_interest_2
                   ,lag(variable_of_interest_3 ,1) 
                        over(partition by
                        some_group order by variable_1,...,variable_n) 
                        as lag_variable_of_interest_3
    from DATA

I wonder, is this because each lag function must by its own partition and order the whole data set, even though the are using the same partition and order?

I am not 100% sure about how DB2 optimizes such queries. If it executes each lag independently, then there is definitely room to improve the optimizer.

One method you could use is lag() with a join on the primary key :

select t.*, tprev.*
from (select t.*, lag(id) over ( . . . ) as prev_id
      from t
     ) t left join
     t tprev
     on t.id = tprev.prev_id ;

From what you describe, this might be the most efficient method to do what you want.

This should be more efficient than row_number() because the join can make use of an index.

Db2 will only sort the data once, if all OLAP functions use the same PARTITION BY and ORDER BY . You can confirm this by looking at an explain plan.

create table data(v1 int, v2 int, v3 int, g1 int, g2 int, o1 int, o2 int) organize by row
;
explain plan for
select  g1
,       g2
,       o1
,       o2
,       v1
,       v2
,       v3
,       lag(v1) over(partition by g1, g2 order by o1, o2 ) as lag_v1
,       lag(v2) over(partition by g1, g2 order by o1, o2 ) as lag_v2
,       lag(v3) over(partition by g1, g2 order by o1, o2 ) as lag_v3
from
    data
;

will give the following plan (using db2exfmt -1 -d $DATABASE ). You can see there is only one SORT operator

Access Plan:
-----------

    Total Cost:             14.839
    Query Degree:           4



      Rows 
     RETURN
     (   1)
      Cost 
       I/O 
       |
      1000 
     LMTQ  
     (   2)
     14.839 
        2 
       |
      1000 
     TBSCAN
     (   3)
     14.5555 
        2 
       |
      1000 
     SORT  
     (   4)
     14.5554 
        2 
       |
      1000 
     TBSCAN
     (   5)
     14.2588 
        2 
       |
      1000 
 TABLE: PAUL    
      DATA
       Q1

BTW If you post a question with a real SQL query (along with some DDL and some idea of the data volumes), we might be able to suggest things that could improve the performance of getting lagged values. It is difficult to advise in detail without seeing a better example

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM