简体   繁体   English

使用大型数据集加快Postgresql查询

[英]Speeding up a postgresql query with large data sets

I have a postgresql database which has 2 tables that I am interested in at the moment. 我有一个PostgreSQL数据库,其中有我目前感兴趣的2个表。 The first table is my "file" table, which contains a file name, some relevant information about that file, and has a serial id as its primary key. 第一个表是我的“文件”表,其中包含文件名,有关该文件的一些相关信息,并具有序列ID作为其主键。 Here is a rough outline of my file table: 这是我的文件表的大致轮廓:

fileData(fileName varchar(120) unique, ... other info, id serial primary key)

I then have another table that contains information from the files listed in the file table. 然后,我还有另一个表,其中包含文件表中列出的文件中的信息。 It is linked to the file table though the id of the file table. 它通过文件表的ID链接到文件表。 There is a variable number of lines in the "data" table corresponding to each file, with the line numbers varying from several hundred to several hundred thousand. 每个文件对应的“数据”表中的行数是可变的,行号从几百到几十万不等。 Here is a rough outline of my data table: 这是我的数据表的大致轮廓:

rawData(fileID integer references fileData(id), lineNum integer, data1 double, ... other info)

To go with the above, I have a query where I first sort through the fileData to get the id of each file, as well as some of the other info. 结合以上内容,我有一个查询,在该查询中我首先对fileData进行排序以获取每个文件的ID以及其他一些信息。 Then I am looking to sort through the raw data corresponding to that file to find "interesting" information. 然后,我希望对与该文件相对应的原始数据进行排序,以找到“有趣的”信息。 This particular query is written in c++ using Qt to handle the actual processing, but the majority of the work is being done by the database (and Qt just passes the database query in as a text query and that query needs to match all of the formatting that the sql database would normally need). 这个特定的查询是使用Qt用c ++编写的,以处理实际的处理,但是大部分工作是由数据库完成的(并且Qt只是将数据库查询作为文本查询传递给该查询,并且该查询需要匹配所有格式sql数据库通常需要的)。 Below is an example of my query: 以下是我的查询示例:

QSqlQuery fileQuery, dataQuery;
int id;
fileQuery.prepare("SELECT id, fileType FROM fileData ORDER BY id");
if (!fileQuery.exec()){
    //error
    return;
}
while (fileQuery.next()){
    id = fileQuery.value(0).toInt();
    dataQuery.prepare("Select lineNum, data1, ...other info "
                      "FROM rawData WHERE fileID = ? and data1 < ? "
                      "ORDER BY fileID, lineNum");
    dataQuery.addBindValue(id);
    dataQuery.addBindValue(num);
    if (!dataQuery.exec()){
        return;
    }
    while (dataQuery.next()){
      //code to load pertinant info into my program to handle later
    }
}

This program took about 2 hours or so to run up until recently, with 1400 files loaded and about a million or so lines of data. 该程序花了大约2个小时左右的时间才能运行,直到最近,它已经加载了1400个文件和大约一百万行数据。 However, I just got a bunch more data, and now am up to 1650 files of data, with 130 million lines of data, and my program has slowed to a crawl. 但是,我只获得了更多数据,现在最多可以存储1650个数据文件,其中有1.3亿行数据,而且我的程序运行缓慢。 What used to take two hours has now taken over 6 to go through only 1/4 of the files that I now have, and my debug output has told me that I am still working through files that I have run this program on previously, not any new data yet. 过去需要两个小时才能完成的工作现在已经花费了6个以上的时间来处理我现在拥有的文件的1/4,而且调试输出告诉我,我仍在处理我以前运行过该程序的文件,而不是任何新数据呢。 Checking my task manager, I can see that my program is barely working, while postgresql is using an entire core to give me the data I am asking, so I know that the current hold up is in my sql commands, not in what I am doing with the data in the meantime. 检查我的任务管理器,我可以看到我的程序几乎无法正常工作,而postgresql使用整个内核来提供我所要求的数据,因此我知道当前的搁置在我的sql命令中,而不是我的当前命令中同时处理数据。

Lastly, at the moment, throwing more hardware at the problem is not something that I can do. 最后,此刻,我无法解决更多硬件问题。 With that being said, is there anything that I can do to optimize my queries to increase the speed at which I am accessing this data? 话虽这么说,我是否可以做些优化我的查询,以提高访问这些数据的速度? Or am I already doing things correctly, and just have to suck it up and deal with the slowness due to the size of the data set that I am working with? 还是我已经在正确地做事了,而仅仅由于我正在使用的数据集的大小而不得不将其吸纳起来并处理缓慢问题?

You probably could only execute each query only once. 您可能只能对每个查询执行一次。

1) The file table is so small that you can load it in a memory map and be done with it 1)文件表太小,您可以将其加载到内存映射中并使用它完成

2) The query on the data table, filtered by fileType and ordered by file id, should not take ages (of course you have an index on fileID + lineNum, right ?) 2)数据表上的查询(按fileType过滤并按文件ID排序)不应使用时间(当然,您在fileID + lineNum上有一个索引,对吗?)

Is there any reason not to combine the two queries into one? 有什么理由不将两个查询合并为一个吗?

SELECT id, fileType, lineNum, data1, ...other info 
FROM fileData LEFT JOIN rawData on fileData.id = rawData.fileID
WHERE data1 < ? 
ORDER BY fileID, lineNum

Also since you say num is a constant in the function, instead of binding it to a replaceable parameter, I'd just construct the query string with its value. 另外,由于您说num是函数中的常数,因此与其将其绑定到可替换参数,不如将其值构造为查询字符串。 Making sure you have the right indexes on both tables is imperative also. 确保两个表上的索引正确也是必须的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM