简体   繁体   中英

SQLite Database runs very slow with very simple query. How can I improve performance?

For a research project I have created a sqlite database that stores news articles. Currently the database is 272GB big and stored on a cloud volume of 2TB. My cloud machine has 32 Core and 128GB of RAM and is attached to this volume.

I am running the following query: "select * from articles where year={} and source in {}" in which I replace '{}' with a year and about 6 sources.

Running this query takes about 1h and results in about 450k rows being yielded by the DB (out of 90 million total rows). While doing this, CPU usage is virtually at 0%.

The table has been created this way: "create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc REAL, emotionality_nrc REAL, emotionality_liwc REAL, subject_codes TEXT, PRIMARY KEY (source_id, day, month, year, program_name));" and I have indexed source and year separately.

The query explanation is: QUERY PLAN`--SEARCH articles USING INDEX idx_articles_on_year_source (year=? AND source=?)

I ran an ioping test at the directory the database is stored and got:

--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us

and the following fio test fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75 gave this result:

read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)

I also tried things like ```PRAGMA synchronous=OFF`` and different journals such as memory and WAL.

I am a bit lost on why the database is so slow and what I should do to improve the speed. Have I done a stupid mistake in the setup or is the infrastructure just not good? Should I switch to a data warehouse solution such as amazon redshift?

PS: I am connecting to the db via pythons sqlite3 library and use the following code

 def select_articles_by_year_and_sources(self, year, sources=None):
        cur = self.conn.cursor()
        rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))

        return iter(ResultIterator(rows))

conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()

I just tried copying a 8GB file from the attach volume to my VM. It took 2m and 30sec with the bash cp command. I guess that means the bandwidth to the attached volume is quite slow?

Your query plan shows that the index on the two columns in your WHERE clause - year and source - is being used, so you might not be able to speed it up. It's possible, though, that depending on the distribution of your data, instead of having an index on articles(year, source) , one on articles(source, year) might be better by pruning out more rows faster.

You can try adding that new index, and then running an ANALYZE on the database to generate statistics about the indexes that SQLite uses to pick which of several possible indexes it thinks will work better. Check the EXPLAIN QUERY PLAN output after to see if it's using the new index or still on the old one, and then drop whichever index isn't being used (Or if it's slower in practice with the new index, drop that one).

Another option is using the sqlite3 command line program's .expert command , which generates index suggestions for queries, to see what it comes up with in this case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM