简体   繁体   English

在 sqlite3 中使用多线程时的性能问题

[英]Performance issue when using multiple threads with sqlite3

I am writing a program that generates hashes for files in all subdirectories and then puts them in a database or prints them to standard output: https://github.com/cherrry9/dedup我正在编写一个程序,为所有子目录中的文件生成哈希,然后将它们放入数据库或将它们打印到标准 output: https://github.com/cherrry9/dedup

In the latest commit, I added option for my program to use multiple threads ( THREADS macro).在最新的提交中,我为我的程序添加了使用多个线程的选项( THREADS宏)。

Here are some benchmarks that I did:以下是我做的一些基准测试:

$ test() { /usr/bin/time -p ./dedup / -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 test
real 8.03
user 4.34
sys 4.55
$ make clean all THREADS=4 && test
real 3.94
user 7.66
sys 7.42

As you can the version compiled with THREADS=4 was 2 times faster.如您所见,使用THREADS=4编译的版本快 2 倍。

Now I will use the second positional argument to specify sqlite3 database:现在我将使用第二个位置参数来指定 sqlite3 数据库:

$ test() { /usr/bin/time -p ./dedup / test.db -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 && ​test
real 20.40
user 7.58
sys 7.29
$ rm test.db
$ make clean all THREADS=4 && ​test
real 21.86
user 17.17
sys 18.15

Version compiled with THREADS=4 was slower than version that used THREADS=1 !使用THREADS=4编译的版本比使用THREADS=1的版本慢!

When I used second argument, in dedup.c was executed this code that inserted hashes to database:当我使用第二个参数时,在dedup.c中执行了将哈希插入数据库的代码:

if (sql != NULL && sql_insert(sql, entry->fpath, hash) != 0) {
// ...

sql_insert uses transactions to prevent sqlite from writing to database every time I call INSERT . sql_insert使用事务来防止 sqlite 每次调用INSERT时写入数据库。

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    int errcode;

    pthread_mutex_lock(&sql->mtx);
    sqlite3_bind_text(sql->stmt, 1, filename, -1, NULL);
    sqlite3_bind_blob(sql->stmt, 2, hash, SHA256_LENGTH, NULL);

    sqlite3_step(sql->stmt);
    SQL_TRY(sqlite3_reset(sql->stmt));

    if (++sql->insertc >= INSERT_LIM) {
        SQL_TRY(sqlite3_exec(sql->database, "COMMIT;BEGIN", NULL, NULL, NULL));
        sql->insertc = 0;
    }

    pthread_mutex_unlock(&sql->mtx);
    return 0;
}

This fragment is executed for every processed file and for some reason it's blocking all threads in my program.这个片段针对每个处理过的文件执行,并且由于某种原因它阻塞了我程序中的所有线程。

And here's my question, how can i prevent sqlite from blocking threads and degrading the performance of my program?这是我的问题,如何防止 sqlite 阻塞线程并降低程序性能?

Here is dedup option explanation if you wonder what test function is doing:如果您想知道 function 正在做什么test ,这里是dedup选项说明:

1th positional argument - directory to use to generate hashes
2th positional argument - path to databases which will be used by sqlite3
-v level  - verbose level (0 means print only errors)
-c nbytes - read nbytes from each file
-e regex  - exclude directories that match regex

I'm using serialized mode in sqlite3.我在 sqlite3 中使用序列化模式。

It seems that all your threads use the same database connection and statement objects.您的所有线程似乎都使用相同的数据库连接和语句对象。 Therefore you have a race-condition (even in SERIALIZED threading model), as multiple threads are binding, stepping, and resetting the same statement.因此,您有一个竞争条件(即使在 SERIALIZED 线程模型中),因为多个线程正在绑定、步进和重置同一语句。 Asking 'why is it slow' becomes irrelevant until you fix this problem.在您解决此问题之前,询问“为什么它很慢”变得无关紧要。

Instead you should wrap your sql_insert with a mutex to guarantee that at most one thread is accessing the database connection:相反,您应该用互斥锁包装您的sql_insert以保证最多有一个线程正在访问数据库连接:

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    pthread_mutex_lock(&sql->mutex);
    // ... actual insert and exec code ...
    pthread_mutex_unlock(&sql->mutex);
    return 0;
}

Then add and initialize that mutex in your SQL structure with pthread_mutex_init .然后使用pthread_mutex_initSQL结构中添加并初始化该互斥锁。

You'll see the performance boost if your bottleneck is indeed the computation of SHA-256 rather than writing into the database.如果您的瓶颈确实是 SHA-256 的计算而不是写入数据库,您将看到性能提升。 Otherwise the overhead of this mutex should be negligible and the number of threads will not have a significant effect of the run-time.否则,这个互斥体的开销应该可以忽略不计,线程数不会对运行时产生显着影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM