What is the best way, algorithm, method to difference large lists of data?

Question

I am receiving a large list of current account numbers daily, and storing them in a database. My task is to find added and released accounts from each file. Right now, I have 4 SQL tables, (AccountsCurrent, AccountsNew, AccountsAdded, AccountsRemoved). When I receive a file, I am adding it entirely to AccountsNew. Then running the below queries to find which we added and removed.

INSERT AccountsAdded(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew WHERE AccountNumber not in (SELECT AccountNum FROM AccountsCurrent)

INSERT AccountsRemoved(AccountNum, Name) SELECT AccountNum, Name FROM AccountsCurrent WHERE AccountNumber not in (SELECT AccountNum FROM AccountsNew)

TRUNCATE TABLE AccountsCurrent

INSERT AccountsCurrent(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew

TRUNCATE TABLE AccountsNew

Right now, I am differencing about 250,000 accounts, but this is going to keep growing. Is this the best method, do you have any other ideas?

EDIT: This is an MSSQL 2000 database. I'm using c# to process the file.

The only data I am focused on is the accounts that were added and removed between the last and current files. The AccountsCurrent, is only used to determine what accounts were added or removed.

Answer 1

Sounds like a history/audit process that might be better done using triggers. Have a separate history table that captures changes (eg, timestamp, operation, who performed the change, etc.)

New and deleted accounts are easy to understand. "Current" accounts implies that there's an intermediate state between being new and deleted. I don't see any difference between "new" and "added".

I wouldn't have four tables. I'd have a STATUS table that would have the different possible states, and ACCOUNTS or the HISTORY table would have a foreign key to it.

Answer 2

To be honest, I think that I'd follow something like your approach. One thing is that you could remove the truncate, do a rename of the "new" to "current" and re-create "new".

Answer 3

Using IN clauses on long lists can be slow.

If the tables are indexed, using a LEFT JOIN can prove to be faster...

INSERT INTO [table] (
    [fields]
    )
SELECT
    [fields]
FROM
    [table1]
LEFT JOIN
    [table2]
        ON [join condition]
WHERE
    [table2].[id] IS NULL

This assumes 1:1 relationships and not 1:many. If you have 1:many you can do any of...
1. SELECT DISTINCT
2. Use a GROUP BY clause
3. Use a different query, see below...

INSERT INTO [table] (
    [fields]
    )
SELECT
    [fields]
FROM
    [table1]
WHERE
    EXISTS (SELECT * FROM [table2] WHERE [condition to match tables 1 and 2])

-- # This is quick provided that all fields to match the two tables are
-- # indexed in both tables.  Should then be much faster than the IN clause.

Answer 4

您也可以减去交点以得到一张表中的差异。

Answer 5

如果以合理且一致的方式订购初始文件（大IF！），则作为逻辑比较文件的C＃程序，其运行速度将大大提高。

What is the best way, algorithm, method to difference large lists of data?

Question

5 answers

solution1
1 2009-01-20 15:18:22

solution2
1 ACCPTED 2009-01-20 15:30:37

solution3
1 2009-01-20 16:52:12

solution4
0 2009-01-20 17:21:58

solution5
0 2009-01-20 17:29:13

What is the best way, algorithm, method to difference large lists of data?

Question

5 answers

solution1 1 2009-01-20 15:18:22

solution2 1 ACCPTED 2009-01-20 15:30:37

solution3 1 2009-01-20 16:52:12

solution4 0 2009-01-20 17:21:58

solution5 0 2009-01-20 17:29:13

solution1
1 2009-01-20 15:18:22

solution2
1 ACCPTED 2009-01-20 15:30:37

solution3
1 2009-01-20 16:52:12

solution4
0 2009-01-20 17:21:58

solution5
0 2009-01-20 17:29:13