简体   繁体   English

SQLite 数据库运行非常缓慢,查询非常简单。 我怎样才能提高性能?

[英]SQLite Database runs very slow with very simple query. How can I improve performance?

For a research project I have created a sqlite database that stores news articles.对于一个研究项目,我创建了一个存储新闻文章的 sqlite 数据库。 Currently the database is 272GB big and stored on a cloud volume of 2TB.目前该数据库为 272GB,存储在 2TB 的云卷上。 My cloud machine has 32 Core and 128GB of RAM and is attached to this volume.我的云机器有 32 核和 128GB 的​​ RAM,并附加到这个卷上。

I am running the following query: "select * from articles where year={} and source in {}" in which I replace '{}' with a year and about 6 sources.我正在运行以下查询: "select * from articles where year={} and source in {}" ,其中我将“{}”替换为一年和大约 6 个来源。

Running this query takes about 1h and results in about 450k rows being yielded by the DB (out of 90 million total rows).运行此查询大约需要 1 小时,并导致数据库产生大约 450k 行(总共 9000 万行)。 While doing this, CPU usage is virtually at 0%.执行此操作时,CPU 使用率几乎为 0%。

The table has been created this way: "create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc REAL, emotionality_nrc REAL, emotionality_liwc REAL, subject_codes TEXT, PRIMARY KEY (source_id, day, month, year, program_name));" The table has been created this way: "create table if not exists articles(source_id TEXT, source TEXT, day INTEGER, month INTEGER, year INTEGER, program_name TEXT, transcript TEXT, parliament INTEGER, top1_topic INTEGER, top1_acc REAL, top2_topic INTEGER, top2_acc REAL, top3_topic INTEGER, top3_acc REAL, emotionality_nrc REAL, emotionality_liwc REAL, subject_codes TEXT, PRIMARY KEY (source_id, day, month, year, program_name));" and I have indexed source and year separately.我已经分别索引了来源和年份。

The query explanation is: QUERY PLAN`--SEARCH articles USING INDEX idx_articles_on_year_source (year=? AND source=?)查询解释为: QUERY PLAN`--SEARCH articles USING INDEX idx_articles_on_year_source (year=? AND source=?)

I ran an ioping test at the directory the database is stored and got:我在存储数据库的目录中运行了 ioping 测试并得到:

--- . (ext4 /dev/vdb) ioping statistics ---
99 requests completed in 31.1 ms, 396 KiB read, 3.18 k iops, 12.4 MiB/s
generated 100 requests in 1.65 min, 400 KiB, 1 iops, 4.04 KiB/s
min/avg/max/mdev = 157.4 us / 314.5 us / 477.6 us / 76.8 us

and the following fio test fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75 gave this result:和下面的 fio 测试fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75给出了这个结果:

read: IOPS=10.8k, BW=42.3MiB/s (44.4MB/s)
write: IOPS=3619, BW=14.1MiB/s (14.8MB/s)

I also tried things like ```PRAGMA synchronous=OFF`` and different journals such as memory and WAL.我还尝试了诸如“PRAGMA synchronous=OFF”之类的东西以及 memory 和 WAL 等不同的期刊。

I am a bit lost on why the database is so slow and what I should do to improve the speed.我对数据库为什么这么慢以及我应该做些什么来提高速度有点迷茫。 Have I done a stupid mistake in the setup or is the infrastructure just not good?我在设置中犯了一个愚蠢的错误还是基础设施不好? Should I switch to a data warehouse solution such as amazon redshift?我应该切换到数据仓库解决方案,例如 amazon redshift 吗?

PS: I am connecting to the db via pythons sqlite3 library and use the following code PS:我通过 pythons sqlite3 库连接到数据库并使用以下代码

 def select_articles_by_year_and_sources(self, year, sources=None):
        cur = self.conn.cursor()
        rows = cur.execute(select_articles_by_year_and_sources_query.format(year, sources))

        return iter(ResultIterator(rows))

conn = db.NewsDb(path_db) # connect to database
articles = list(conn.select_articles_by_year_and_sources(year, sources))
conn.close()

I just tried copying a 8GB file from the attach volume to my VM.我刚刚尝试将一个 8GB 文件从附加卷复制到我的虚拟机。 It took 2m and 30sec with the bash cp command.使用 bash cp 命令需要 2m 和 30 秒。 I guess that means the bandwidth to the attached volume is quite slow?我想这意味着附加卷的带宽很慢?

Your query plan shows that the index on the two columns in your WHERE clause - year and source - is being used, so you might not be able to speed it up.您的查询计划显示WHERE子句中的两列( yearsource )的索引正在被使用,因此您可能无法加快速度。 It's possible, though, that depending on the distribution of your data, instead of having an index on articles(year, source) , one on articles(source, year) might be better by pruning out more rows faster.但是,根据数据的分布情况,可能会通过更快地修剪更多行而不是在articles(year, source)上建立索引,而在articles(source, year)上建立索引可能会更好。

You can try adding that new index, and then running an ANALYZE on the database to generate statistics about the indexes that SQLite uses to pick which of several possible indexes it thinks will work better.您可以尝试添加该新索引,然后在数据库上运行ANALYZE以生成有关索引的统计信息,SQLite 使用这些索引来选择它认为可以更好地工作的几个可能的索引。 Check the EXPLAIN QUERY PLAN output after to see if it's using the new index or still on the old one, and then drop whichever index isn't being used (Or if it's slower in practice with the new index, drop that one).之后检查EXPLAIN QUERY PLAN output 以查看它是使用新索引还是仍在旧索引上,然后删除未使用的任何索引(或者如果在实践中使用新索引较慢,则删除该索引)。

Another option is using the sqlite3 command line program's .expert command , which generates index suggestions for queries, to see what it comes up with in this case.另一种选择是使用sqlite3 命令行程序的.expert命令,它会为查询生成索引建议,看看在这种情况下会产生什么结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM