简体   繁体   English

区别大数据列表的最佳方法,算法和方法是什么?

[英]What is the best way, algorithm, method to difference large lists of data?

I am receiving a large list of current account numbers daily, and storing them in a database. 我每天都会收到大量当前帐号的清单,并将它们存储在数据库中。 My task is to find added and released accounts from each file. 我的任务是从每个文件中查找已添加和已发布的帐户。 Right now, I have 4 SQL tables, (AccountsCurrent, AccountsNew, AccountsAdded, AccountsRemoved). 现在,我有4个SQL表(AccountsCurrent,AccountsNew,Account☎联系人,Accounts已删除)。 When I receive a file, I am adding it entirely to AccountsNew. 收到文件后,我便将其完全添加到AccountsNew。 Then running the below queries to find which we added and removed. 然后运行以下查询以查找我们添加和删除的内容。

INSERT AccountsAdded(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew WHERE AccountNumber not in (SELECT AccountNum FROM AccountsCurrent)

INSERT AccountsRemoved(AccountNum, Name) SELECT AccountNum, Name FROM AccountsCurrent WHERE AccountNumber not in (SELECT AccountNum FROM AccountsNew)

TRUNCATE TABLE AccountsCurrent

INSERT AccountsCurrent(AccountNum, Name) SELECT AccountNum, Name FROM AccountsNew

TRUNCATE TABLE AccountsNew

Right now, I am differencing about 250,000 accounts, but this is going to keep growing. 现在,我要区别大约250,000个帐户,但是这个数字将继续增长。 Is this the best method, do you have any other ideas? 这是最好的方法吗,您还有其他想法吗?

EDIT: This is an MSSQL 2000 database. 编辑:这是一个MSSQL 2000数据库。 I'm using c# to process the file. 我正在使用C#处理文件。

The only data I am focused on is the accounts that were added and removed between the last and current files. 我关注的唯一数据是在上一个文件和当前文件之间添加和删除的帐户。 The AccountsCurrent, is only used to determine what accounts were added or removed. AccountsCurrent,仅用于确定添加或删除了哪些帐户。

Sounds like a history/audit process that might be better done using triggers. 听起来像一个历史/审核过程,使用触发器可能会更好。 Have a separate history table that captures changes (eg, timestamp, operation, who performed the change, etc.) 有一个单独的历史表来捕获更改(例如,时间戳记,操作,执行更改的人等)

New and deleted accounts are easy to understand. 新帐户和已删除帐户很容易理解。 "Current" accounts implies that there's an intermediate state between being new and deleted. “当前”帐户表示在新建和删除之间存在中间状态。 I don't see any difference between "new" and "added". 我看不到“新”和“添加”之间的任何区别。

I wouldn't have four tables. 我不会有四个桌子。 I'd have a STATUS table that would have the different possible states, and ACCOUNTS or the HISTORY table would have a foreign key to it. 我有一个STATUS表,该表具有可能的不同状态,而ACCOUNTS或HISTORY表将具有一个外键。

To be honest, I think that I'd follow something like your approach. 老实说,我认为我会遵循类似您的方法。 One thing is that you could remove the truncate, do a rename of the "new" to "current" and re-create "new". 一件事是您可以删除截断,将“新”重命名为“当前”,然后重新创建“新”。

Using IN clauses on long lists can be slow. 在长列表上使用IN子句可能很慢。

If the tables are indexed, using a LEFT JOIN can prove to be faster... 如果表已建立索引,则使用LEFT JOIN可以证明速度更快...

INSERT INTO [table] (
    [fields]
    )
SELECT
    [fields]
FROM
    [table1]
LEFT JOIN
    [table2]
        ON [join condition]
WHERE
    [table2].[id] IS NULL

This assumes 1:1 relationships and not 1:many. 这假设1:1关系,而不是1:许多关系。 If you have 1:many you can do any of... 如果您有1:许多,则可以执行以下任何一项...
1. SELECT DISTINCT 1.选择地区
2. Use a GROUP BY clause 2.使用GROUP BY子句
3. Use a different query, see below... 3.使用其他查询,请参见下文...

INSERT INTO [table] (
    [fields]
    )
SELECT
    [fields]
FROM
    [table1]
WHERE
    EXISTS (SELECT * FROM [table2] WHERE [condition to match tables 1 and 2])

-- # This is quick provided that all fields to match the two tables are
-- # indexed in both tables.  Should then be much faster than the IN clause.

您也可以减去交点以得到一张表中的差异。

如果以合理且一致的方式订购初始文件(大IF!),则作为逻辑比较文件的C#程序,其运行速度将大大提高。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM