I have two databases with identical schema and I want to effectively do a diff on one of the tables. Ie return only the unique records, discounting the primary key.
columns = zip(*db1.execute("PRAGMA table_info(foo)").fetchall())[1]
db1.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db1.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
db2.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db2.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
data = db2.execute("""
SELECT
one.*
FROM
db1.foo AS one
JOIN db2.foo
AS two
WHERE {}
""".format(' AND '.join( ['one.{0}!=two.{0}'.format(c) for c in columns[1:]]))
).fetchall()
That is, ignoring the primary key (in this case meow
), don't return the records that exist identically in both databases.
The table foo
in db1
looks like:
meow mix please deliver
1 123 abc
2 234 bcd two
3 345 cde
And the table foo
in db2
looks like:
meow mix please deliver
1 345 cde
2 123 abc one
3 234 bcd two
4 456 def four
So the unique entries from db2
are:
[(2, 123, 'abc', 'one'), (4, 456, 'def', 'four')]
which is what I get. This works great if I have more than two columns. But if there are only two, ie a primary key and a value such as in a lookup table:
bar baz bar baz
1 123 1 234
2 234 2 345
3 345 3 123
4 456
I get all non-unique values repeated N-1 times and unique values repeated N times, where N is the number of records in db1
. I understand why this is happening but I don't know how to fix it.
[(1, '234'),
(1, '234'),
(2, '345'),
(2, '345'),
(3, '123'),
(3, '123'),
(4, '456'),
(4, '456'),
(4, '456')]
One idea I had was to just take the modulus after pulling all the duplicate results:
N = db1.execute("SELECT Count(*) FROM foo").fetchone()[0]
data = [
list(data)
for data,n in itertools.groupby(sorted(data))
if np.mod(len(list(n)),N)==0
]
Which does work:
[[4, '456']]
But this seems messy and I'd like to do it all in that first SQL query if possible.
Also, on large tables (my real db has ~10k records) this takes a long time. Any way to optimize this? Thanks!
Replacing my earlier answer -- here is a good general solution.
Having input tables that look like this:
sqlite> select * from t1;
meow mix please delivery
---------- ---------- ---------- ----------
1 123 abc
2 234 bcd two
3 345 cde
and
sqlite> select * from t2;
meow mix please delivery
---------- ---------- ---------- ----------
1 345 cde
2 123 abc one
3 234 bcd two
4 456 def four
You can get records that are in t2 / not in t1 (ignoring PK's) like this:
select sum(q1.db), mix, please, delivery from (select 1 as db, mix, please,
delivery from t1 union all select 2 as db, mix, please, delivery from t2) q1
group by mix, please, delivery having sum(db)=2;
sum(q1.db) mix please delivery
---------- ---------- ---------- ----------
2 123 abc one
2 456 def four
You can do different set operations by changing the value in the having clause. SUM(DB)=1
returns records in 1 / not in 2; SUM(DB)=2
returns records in 2 / not in 1; SUM(DB)=1 OR SUM(DB)=2
returns records that exist in either but not both; and SUM(DB)=3
returns records that exist in both.
The only thing this doesn't do for you is return the PK. This can't be done in the query I've written because the GROUP BY
and SUM
operations only work on common / aggregated data, and the PK fields are by definition unique. If you know the combination of non-PK fields is unique within each DB, you could use the returned records to create a new query to find the PK as well.
Note this approach extends nicely to more than 2 tables. By making the db field a power of 2, you can operate on any number of tables. Eg if you did 1 as db for t1, 2 as db for t2, 4 as db for t3, 8 as db for t4, you could find any intersection / difference of the tables you want by changing the having condition -- eg HAVING SUM(DB)=5
would return records that are in t1 and t3 but not in t2 or t4.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.