简体   繁体   中英

Python sqlite3: Diff of table in two databases

I have two databases with identical schema and I want to effectively do a diff on one of the tables. Ie return only the unique records, discounting the primary key.

columns = zip(*db1.execute("PRAGMA table_info(foo)").fetchall())[1]
db1.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db1.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
db2.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db2.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
data = db2.execute("""
    SELECT 
        one.* 
    FROM 
        db1.foo AS one 
        JOIN db2.foo 
        AS two 
    WHERE {}
    """.format(' AND '.join( ['one.{0}!=two.{0}'.format(c) for c in columns[1:]]))
).fetchall()

That is, ignoring the primary key (in this case meow ), don't return the records that exist identically in both databases.

The table foo in db1 looks like:

meow    mix    please   deliver
1       123    abc
2       234    bcd      two
3       345    cde

And the table foo in db2 looks like:

meow    mix    please   deliver
1       345    cde
2       123    abc      one
3       234    bcd      two     
4       456    def      four

So the unique entries from db2 are:

[(2, 123, 'abc', 'one'), (4, 456, 'def', 'four')]

which is what I get. This works great if I have more than two columns. But if there are only two, ie a primary key and a value such as in a lookup table:

bar  baz         bar   baz
1    123         1     234
2    234         2     345
3    345         3     123
                 4     456

I get all non-unique values repeated N-1 times and unique values repeated N times, where N is the number of records in db1 . I understand why this is happening but I don't know how to fix it.

[(1, '234'),
 (1, '234'),
 (2, '345'),
 (2, '345'),
 (3, '123'),
 (3, '123'),
 (4, '456'),
 (4, '456'),
 (4, '456')]

One idea I had was to just take the modulus after pulling all the duplicate results:

N = db1.execute("SELECT Count(*) FROM foo").fetchone()[0]
data = [
     list(data) 
     for data,n in itertools.groupby(sorted(data)) 
     if np.mod(len(list(n)),N)==0
]

Which does work:

[[4, '456']]

But this seems messy and I'd like to do it all in that first SQL query if possible.

Also, on large tables (my real db has ~10k records) this takes a long time. Any way to optimize this? Thanks!

Replacing my earlier answer -- here is a good general solution.

Having input tables that look like this:

sqlite> select * from t1;
meow        mix         please      delivery  
----------  ----------  ----------  ----------
1           123         abc                   
2           234         bcd         two       
3           345         cde                   

and

sqlite> select * from t2;
meow        mix         please      delivery  
----------  ----------  ----------  ----------
1           345         cde                   
2           123         abc         one       
3           234         bcd         two       
4           456         def         four      

You can get records that are in t2 / not in t1 (ignoring PK's) like this:

select sum(q1.db), mix, please, delivery from (select 1 as db, mix, please,
delivery from t1 union all select 2 as db, mix, please, delivery from t2) q1
group by mix, please, delivery having sum(db)=2; 

sum(q1.db)  mix         please      delivery  
----------  ----------  ----------  ----------
2           123         abc         one       
2           456         def         four      

You can do different set operations by changing the value in the having clause. SUM(DB)=1 returns records in 1 / not in 2; SUM(DB)=2 returns records in 2 / not in 1; SUM(DB)=1 OR SUM(DB)=2 returns records that exist in either but not both; and SUM(DB)=3 returns records that exist in both.

The only thing this doesn't do for you is return the PK. This can't be done in the query I've written because the GROUP BY and SUM operations only work on common / aggregated data, and the PK fields are by definition unique. If you know the combination of non-PK fields is unique within each DB, you could use the returned records to create a new query to find the PK as well.

Note this approach extends nicely to more than 2 tables. By making the db field a power of 2, you can operate on any number of tables. Eg if you did 1 as db for t1, 2 as db for t2, 4 as db for t3, 8 as db for t4, you could find any intersection / difference of the tables you want by changing the having condition -- eg HAVING SUM(DB)=5 would return records that are in t1 and t3 but not in t2 or t4.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM