Python sqlite3 never returns an inner join with 28 milion+ rows

Question

Sqlite database with two tables, each over 28 million rows long. Here's the schema:

CREATE TABLE MASTER (ID INTEGER PRIMARY KEY AUTOINCREMENT,PATH TEXT,FILE TEXT,FULLPATH TEXT,MODIFIED_TIME FLOAT);

CREATE TABLE INCREMENTAL (INC_ID INTEGER PRIMARY KEY AUTOINCREMENT,INC_PATH TEXT,INC_FILE TEXT,INC_FULLPATH TEXT,INC_MODIFIED_TIME FLOAT);

Here's an example row from MASTER:

ID          PATH             FILE        FULLPATH                 MODIFIED_TIME
----------  ---------------  ----------  -----------------------  -------------
1           e:\ae/BONDS/0/0  100.bin     e:\ae/BONDS/0/0/100.bin  1213903192.5

The tables have mostly identical data, with some differences between MODIFIED_TIME in MASTER and INC_MODIFIED_TIME in INCREMENTAL.

If I execute the following query in sqlite, I get the results I expect:

select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;

That query will pause for a minute or so, return a number of rows, pause again, return some more, etc., and finish without issue. Takes about 2 minutes to fully return everything.

However, if I execute the same query in Python:

changed_files = conn.execute("select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;")

It will never return - I can leave it running for 24 hours and still have nothing. The python32.exe process doesn't start consuming a large amount of cpu or memory - it stays pretty static. And the process itself doesn't actually seem to go unresponsive - however, I can't Ctrl-C to break, and have to kill the process to actually stop the script.

I do not have these issues with a small test database - everything runs fine in Python.

I realize this is a large amount of data, but if sqlite is handling the actual queries, python shouldn't be choking on it, should it? I can do other large queries from python against this database. For instance, this works:

new_files = conn.execute("SELECT DISTINCT INC_FULLPATH, INC_PATH, INC_FILE from INCREMENTAL where INC_FULLPATH not in (SELECT DISTINCT FULLPATH from MASTER);")

Any ideas? Are the pauses in between sqlite returning data causing a problem for Python? Or is something never occurring at the end to signal the end of the query results (and if so, why does it work with small databases)?

Thanks. This is my first stackoverflow post and I hope I followed the appropriate etiquette.

Answer 1

Python tends to have older versions of the SQLite library, especially Python 2.x, where it is not updated.

However, your actual problem is that the query is slow. Use the usual mechanisms to optimize it, such as creating a two-column index on INC_FULLPATH and INC_MODIFIED_TIME .

Python sqlite3 never returns an inner join with 28 milion+ rows

Question

1 answers

solution1
1 ACCPTED 2015-03-05 21:35:44

Python sqlite3 never returns an inner join with 28 milion+ rows

Question

1 answers

solution1 1 ACCPTED 2015-03-05 21:35:44

solution1
1 ACCPTED 2015-03-05 21:35:44