简体   繁体   English

Python sqlite3:两个数据库中表的差异

[英]Python sqlite3: Diff of table in two databases

I have two databases with identical schema and I want to effectively do a diff on one of the tables. 我有两个具有相同模式的数据库,我想在其中一个表上有效地进行差异。 Ie return only the unique records, discounting the primary key. 即仅返回唯一记录,折扣主键。

columns = zip(*db1.execute("PRAGMA table_info(foo)").fetchall())[1]
db1.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db1.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
db2.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db2.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
data = db2.execute("""
    SELECT 
        one.* 
    FROM 
        db1.foo AS one 
        JOIN db2.foo 
        AS two 
    WHERE {}
    """.format(' AND '.join( ['one.{0}!=two.{0}'.format(c) for c in columns[1:]]))
).fetchall()

That is, ignoring the primary key (in this case meow ), don't return the records that exist identically in both databases. 也就是说,忽略主键(在这种情况下是meow ),不要返回两个数据库中存在相同的记录。

The table foo in db1 looks like: db1的表foo如下所示:

meow    mix    please   deliver
1       123    abc
2       234    bcd      two
3       345    cde

And the table foo in db2 looks like: db2的表foo看起来像:

meow    mix    please   deliver
1       345    cde
2       123    abc      one
3       234    bcd      two     
4       456    def      four

So the unique entries from db2 are: 所以db2的唯一条目是:

[(2, 123, 'abc', 'one'), (4, 456, 'def', 'four')]

which is what I get. 这就是我得到的。 This works great if I have more than two columns. 如果我有两列以上,这很有效。 But if there are only two, ie a primary key and a value such as in a lookup table: 但是如果只有两个,即主键和一个值,例如在查找表中:

bar  baz         bar   baz
1    123         1     234
2    234         2     345
3    345         3     123
                 4     456

I get all non-unique values repeated N-1 times and unique values repeated N times, where N is the number of records in db1 . 我得到所有非唯一值重复N-1次,唯一值重复N次,其中N是db1的记录数。 I understand why this is happening but I don't know how to fix it. 我理解为什么会这样,但我不知道如何解决它。

[(1, '234'),
 (1, '234'),
 (2, '345'),
 (2, '345'),
 (3, '123'),
 (3, '123'),
 (4, '456'),
 (4, '456'),
 (4, '456')]

One idea I had was to just take the modulus after pulling all the duplicate results: 我有一个想法是在拉出所有重复结果后取出模数:

N = db1.execute("SELECT Count(*) FROM foo").fetchone()[0]
data = [
     list(data) 
     for data,n in itertools.groupby(sorted(data)) 
     if np.mod(len(list(n)),N)==0
]

Which does work: 哪个工作:

[[4, '456']]

But this seems messy and I'd like to do it all in that first SQL query if possible. 但这看起来很混乱,如果可能的话,我想在第一个SQL查询中完成所有操作。

Also, on large tables (my real db has ~10k records) this takes a long time. 此外,在大型表(我的真实数据库有~10k记录)上,这需要很长时间。 Any way to optimize this? 有什么办法优化这个? Thanks! 谢谢!

Replacing my earlier answer -- here is a good general solution. 取代我之前的答案 - 这是一个很好的通用解决方案。

Having input tables that look like this: 输入表看起来像这样:

sqlite> select * from t1;
meow        mix         please      delivery  
----------  ----------  ----------  ----------
1           123         abc                   
2           234         bcd         two       
3           345         cde                   

and

sqlite> select * from t2;
meow        mix         please      delivery  
----------  ----------  ----------  ----------
1           345         cde                   
2           123         abc         one       
3           234         bcd         two       
4           456         def         four      

You can get records that are in t2 / not in t1 (ignoring PK's) like this: 你可以得到t2 /不在t1(忽略PK)的记录,如下所示:

select sum(q1.db), mix, please, delivery from (select 1 as db, mix, please,
delivery from t1 union all select 2 as db, mix, please, delivery from t2) q1
group by mix, please, delivery having sum(db)=2; 

sum(q1.db)  mix         please      delivery  
----------  ----------  ----------  ----------
2           123         abc         one       
2           456         def         four      

You can do different set operations by changing the value in the having clause. 您可以通过更改having子句中的值来执行不同的set操作。 SUM(DB)=1 returns records in 1 / not in 2; SUM(DB)=1返回1 /不是2的记录; SUM(DB)=2 returns records in 2 / not in 1; SUM(DB)=2返回2 /不是1的记录; SUM(DB)=1 OR SUM(DB)=2 returns records that exist in either but not both; SUM(DB)=1 OR SUM(DB)=2返回存在于其中的记录,但不返回两者; and SUM(DB)=3 returns records that exist in both. SUM(DB)=3返回两者中存在的记录。

The only thing this doesn't do for you is return the PK. 唯一不适合你的是返回PK。 This can't be done in the query I've written because the GROUP BY and SUM operations only work on common / aggregated data, and the PK fields are by definition unique. 这在我编写的查询中无法完成,因为GROUP BYSUM操作仅适用于公共/聚合数据,PK字段根据定义是唯一的。 If you know the combination of non-PK fields is unique within each DB, you could use the returned records to create a new query to find the PK as well. 如果您知道每个数据库中非PK字段的组合是唯一的,您可以使用返回的记录创建新查询以查找PK。

Note this approach extends nicely to more than 2 tables. 请注意,此方法很好地扩展到超过2个表。 By making the db field a power of 2, you can operate on any number of tables. 通过使db字段的幂为2,您可以对任意数量的表进行操作。 Eg if you did 1 as db for t1, 2 as db for t2, 4 as db for t3, 8 as db for t4, you could find any intersection / difference of the tables you want by changing the having condition -- eg HAVING SUM(DB)=5 would return records that are in t1 and t3 but not in t2 or t4. 例如,如果您为t1执行了1作为数据,对于t2执行了2作为db,对于t3执行了4作为db,对于t4执行了db作为db,则可以通过更改条件来找到所需表的任何相交/差异 - 例如, HAVING SUM(DB)=5将返回t1和t3中的记录,但不返回t2或t4中的记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM