简体   繁体   English

提高postgreSQL中简单左连接的性能

[英]Improving performance for simple left join in postgreSQL

I am trying to do a left join between two tables in a postgreSQL database and finding it takes about 14 minutes to run. 我试图在postgreSQL数据库中的两个表之间进行左连接,并且发现它需要大约14分钟才能运行。 From existing SO posts, it seems like this type of join should be on the order of seconds, so I'd like to know how to improve the performance of this join. 从现有的SO帖子来看,似乎这种类型的连接应该在几秒钟的数量级,所以我想知道如何提高此连接的性能。 I'm running 64-bit postgreSQL version 9.4.4 on a Windows 8 machine w/ 8 GB RAM , using pgAdmin III . 我使用pgAdmin III在带有8 GB RAMWindows 8机器上运行64-bit postgreSQL version 9.4.4 The table structures are as follows: 表结构如下:

Table A: "parcels_qtr": 表A:“parcels_qtr”:

parcel (text) | 包裹(文字)| yr (int) | yr(int)| qtr (text) | qtr(文本)| lpid (pk, text) | lpid(pk,text)|

Has 15.5 million rows, each column is indexed, and "lpid" is the primary key. 有1550万行,每列都被编入索引,“lpid”是主键。 I also ran this table through a standard vacuum process. 我还通过标准真空过程运行此表。

Table B: "postalvac_qtr": 表B:“postalvac_qtr”:

parcel (text) | 包裹(文字)| yr (int) | yr(int)| qtr (text) | qtr(文本)| lpid (pk, text) | lpid(pk,text)| vacCountY (int) | vacCountY(int)|

Has 618,000 records, all fields except "vacCountY" are indexed and "lpid" is the primary key. 有618,000条记录,除“vacCountY”之外的所有字段都被编入索引,“lpid”是主键。 This also has gone through a standard vacuum process. 这也经历了标准的真空过程。

When running with data output, it takes about 14 min. 运行数据输出时,大约需要14分钟。 When running with explain (analyze, buffers) it takes a little over a minute. 使用explain (analyze, buffers)时需要花一点多时间。 First question - is this difference in performance wholly attributable to printing the data or is something else going on here? 第一个问题 - 这种性能差异完全可归因于打印数据还是其他相关问题?

And second question, can I get this run time down to a few seconds? 第二个问题,我可以将运行时间缩短到几秒钟吗?

Here is my SQL code: 这是我的SQL代码:

EXPLAIN (ANALYZE, BUFFERS)
select a.parcel,
   a.lpid,
   a.yr,
   a.qtr,
   b."vacCountY"
from parcels_qtr as a
left join postalvac_qtr as b
on a.lpid = b.lpid;

And here are the results of my explain statement: https://explain.depesz.com/s/uKkK 以下是我的解释声明的结果: https//explain.depesz.com/s/uKkK

I'm pretty new to postgreSQL so patience and explanations would be greatly appreciated! 我对postgreSQL很新,所以耐心和解释会非常感激!

You're asking the DB to do quite a bit of work. 你要求DB做很多工作。 Just looking at the explain plan, it's: 只看一下解释计划,它是:

  1. Read in an entire table ( postalvac_qtr ) 读入整个表格( postalvac_qtr
  2. Build a hash based on lpid 基于lpid构建哈希
  3. Read in an entire other, much larger, table ( parcels_qtr ) 读入另一个更大的表( parcels_qtr
  4. Hash each of the 15MM lpid s, and match them to the existing hash table 散列15MM lpid的每一个,并将它们与现有的散列表相匹配

How large are these tables? 这些表有多大? You can check this by issuing: 您可以通过发出以下命令来检查

SELECT pg_size_pretty(pg_relation_size('parcels_qtr'));

I'm almost certain that this hash join is spilling out to disk, and the way it's structured ("give me all of the data from both of these tables"), there's no way it won't. 我几乎可以肯定,这个散列连接会溢出到磁盘,以及它的结构方式(“给我这两个表中的所有数据”),但它绝不可能。

The indices don't help, and can't. 指数没有帮助,也没有。 As long as you're asking for the entirety of a table, using an index would only make things slower -- postgres has to traverse the entire table anyway, so it might as well issue a sequential scan. 只要你要求整个表,使用索引只会使事情变慢 - 无论如何postgres必须遍历整个表,所以它也可以发出顺序扫描。

As for why the query has different performance than the explain analyze , I suspect you're correct. 至于为什么查询具有与explain analyze不同的性能,我怀疑你是正确的。 A combination of 1- sending 15M rows to your client, and 2- trying to display it, is going to cause a significant slowdown above and beyond the actual query. 1-向您的客户端发送15M行,以及2-尝试显示它们的组合将导致实际查询之外的显着减速。

So, what can you do about it? 所以你能对它做点啥?

First, what is this query trying to do? 首先,这个查询试图做什么? How often do you want to grab all of the data in those two tables, completely unfiltered? 您希望多久获取这两个表中的所有数据,完全未经过滤? If it's very common, you may want to consider going back to the requirements stage and figuring out another way to address that need (eg would it be reasonable to grab all the data for a given year and quarter instead?). 如果它很常见,您可能需要考虑回到需求阶段并找出解决该需求的另一种方法(例如,获取给定年份和季度的所有数据是否合理?)。 If it's uncommon (say, a daily export), then 1-14min might be fine. 如果它不常见(例如,每日出口),则1-14分钟可能没问题。

Second, you should make sure that your tables aren't bloated. 其次,你应该确保你的表没有膨胀。 If you experience significant update or delete traffic on your tables, that can grow them over time. 如果您在表上遇到重大updatedelete流量,则会随着时间的推移而增加。 The autovacuum daemon is there to help deal with this, but occasionally issuing a vacuum full will also help. autovacuum守护进程可以帮助解决这个问题,但偶尔发出一个vacuum full也会有所帮助。

Third, you can try tuning your DB config. 第三,您可以尝试调整数据库配置。 In postgresql.conf , there are parameters for things like the expected amount of RAM that your server can use for disk cache, and the amount of RAM the server can use for sorting or joining (before it spills out to disk). postgresql.conf ,有一些参数可用于服务器可用于磁盘高速缓存的预期RAM量,以及服务器可用于排序或连接的RAM量(在它溢出到磁盘之前)。 By tinkering with these sorts of parameters, you might be able to improve the speed. 通过修改这些参数,您可以提高速度。

Fourth, you might want to revisit your schema. 第四,您可能想要重新访问您的架构。 Do you want year and quarter as two separate columns, or would you be better off with a single column of the date type? 您是否希望将年份和季度作为两个单独的列,或者您是否会更好地使用date类型的单个列? Do you want a text key, or would you be better off with a bigint (either serial or derived from the text column), which will likely join more quickly? 你想要一个text键,或者你是否会更好地使用bigint (串行或从text列派生),这可能会更快加入? Are the parcel , yr , and qtr fields actually needed in both tables, or are they duplicate data in one table? 两个表中实际上是否需要parcelyrqtr字段,还是它们在一个表中重复数据?

Anyway, I hope this helps. 无论如何,我希望这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM