简体   繁体   中英

finding consecutive date pairs in SQL

I have a question here that looks a little like some of the ones that I found in search, but with solutions for slightly different problems and, importantly, ones that don't work in SQL 2000.

I have a very large table with a lot of redundant data that I am trying to reduce down to just the useful entries. It's a history table, and the way it works, if two entries are essentially duplicates and consecutive when sorted by date, the latter can be deleted. The data from the earlier entry will be used when historical data is requested from a date between the effective date of that entry and the next non-duplicate entry.

The data looks something like this:

id     user_id effective_date important_value useless_value
1      1       1/3/2007       3               0
2      1       1/4/2007       3               1
3      1       1/6/2007       NULL            1
4      1       2/1/2007       3               0
5      2       1/5/2007       12              1
6      3       1/1/1899       7               0

With this sample set, we would consider two consecutive rows duplicates if the user_id and the important_value are the same. From this sample set, we would only delete row with id =2, preserving the information from 1-3-2007, showing that the important_value changed on 1-6-2007, and then showing the relevant change again on 2-1-2007.

My current approach is awkward and time-consuming, and I know there must be a better way. I wrote a script that uses a cursor to iterate through the user_id values (since that breaks the huge table up into manageable pieces), and creates a temp table of just the rows for that user. Then to get consecutive entries, it takes the temp table, joins it to itself on the condition that there are no other entries in the temp table with a date between the two dates. In the pseudocode below, UDF_SameOrNull is a function that returns 1 if the two values passed in are the same or if they are both NULL.

WHILE (@@fetch_status <> -1)
BEGIN
  SELECT * FROM History INTO #history WHERE user_id = @UserId

  --return entries to delete
  SELECT h2.id
  INTO #delete_history_ids
  FROM #history h1
  JOIN #history h2 ON
    h1.effective_date < h2.effective_date
    AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
  WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)

  DELETE h1
  FROM History h1
  JOIN #delete_history_ids dh ON
    h1.id = dh.id 

  FETCH NEXT FROM UserCursor INTO @UserId
END 

It also loops over the same set of duplicates until there are none, since taking out rows creates new consecutive pairs that are potentially dupes. I left that out for simplicity.

Unfortunately, I must use SQL Server 2000 for this task and I am pretty sure that it does not support ROW_NUMBER() for a more elegant way to find consecutive entries.

Thanks for reading. I apologize for any unnecessary backstory or errors in the pseudocode.

OK, I think I figured this one out, excellent question!

First, I made the assumption that the effective_date column will not be duplicated for a user_id . I think it can be modified to work if that is not the case - so let me know if we need to account for that.

The process basically takes the table of values and self-joins on equal user_id and important_value and prior effective_date . Then, we do 1 more self-join on user_id that effectively checks to see if the 2 joined records above are sequential by verifying that there is no effective_date record that occurs between those 2 records.

It's just a select statement for now - it should select all records that are to be deleted. So if you verify that it is returning the correct data, simply change the select * to delete tcheck .

Let me know if you have questions.

select 
    * 
from 
    History tcheck
    inner join History tprev
        on  tprev.[user_id] = tcheck.[user_id]
            and tprev.important_value = tcheck.important_value
            and tprev.effective_date < tcheck.effective_date
    left join History checkbtwn
        on  tcheck.[user_id] = checkbtwn.[user_id]
            and checkbtwn.effective_date < tcheck.effective_date
            and checkbtwn.effective_date > tprev.effective_date
where
    checkbtwn.[user_id] is null

OK guys, I did some thinking last night and I think I found the answer. I hope this helps someone else who has to match consecutive pairs in data and for some reason is also stuck in SQL Server 2000.

I was inspired by the other results that say to use ROW_NUMBER() , and I used a very similar approach, but with an identity column.

--create table with identity column
CREATE TABLE #history (
  id int, 
  user_id int, 
  effective_date datetime, 
  important_value int, 
  useless_value int,
  idx int IDENTITY(1,1)
)

--insert rows ordered by effective_date and now indexed in order
INSERT INTO #history
SELECT * FROM History 
WHERE user_id = @user_id
ORDER BY effective_date

--get pairs where consecutive values match
SELECT * 
FROM #history h1
JOIN #history h2 ON
  h1.idx+1 = h2.idx
WHERE h1.important_value = h2.important_value

With this approach, I still have to iterate over the results until it returns nothing, but I can't think of any way around that and this approach is miles ahead of my last one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM