简体   繁体   English

大查询后psycopg2泄漏内存

[英]psycopg2 leaking memory after large query

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). 我正在使用psycopg2(我升级到2.5版)在我的postgres数据库的python脚本中运行一个大型查询。 After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). 查询完成后,我关闭光标和连接,甚至运行gc,但进程仍然消耗大量内存(确切地说是7.3gb)。 Am I missing a cleanup step? 我错过了一个清理步骤吗?

import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter. 我遇到了类似的问题,经过几个小时的血,汗和泪,发现答案只需要添加一个参数。

Instead of 代替

cursor = conn.cursor()

write

cursor = conn.cursor(name="my_cursor_name")

or simpler yet 或者更简单

cursor = conn.cursor("my_cursor_name")

The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors 有关详细信息,请访问http://initd.org/psycopg/docs/usage.html#server-side-cursors

I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include "DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor. 我发现这些说明有点令人困惑,因为我需要重写我的SQL以包含“DECLARE my_cursor_name ....”然后“FETCH count 2000 FROM my_cursor_name”但事实证明psycopg会为你完成这一切如果您只是在创建游标时覆盖“name = None”默认参数。

The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. 上面使用fetchone或fetchmany的建议不能解决问题,因为如果你保留name参数unset,psycopg默认会尝试将整个查询加载到ram中。 The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory. 您可能需要做的唯一其他事情(除了声明一个名称参数)是将cursor.itersize属性从默认的2000更改为1000,如果您仍然有太少的内存。

Please see the next answer by @joeblog for the better solution. 请查看@joeblog的下一个答案 ,以获得更好的解决方案。


First, you shouldn't need all that RAM in the first place. 首先,您不应该首先需要所有RAM。 What you should be doing here is fetching chunks of the result set. 你应该在这里做的是获取结果集的 Don't do a fetchall() . 不要做一个fetchall() Instead, use the much more efficient cursor.fetchmany method. 相反,使用更有效的cursor.fetchmany方法。 See the psycopg2 documentation . 请参阅psycopg2文档

Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term. 现在,解释为什么它没有被释放,以及为什么这不是正式正确使用该术语时的内存泄漏。

Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program. 大多数进程在释放后不会将内存释放回操作系统,只是让它可以在程序的其他地方重用。

Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. 如果程序可以压缩通过内存分散的剩余对象,则只能将内存释放到OS。 This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. 这只有在使用间接句柄引用时才有可能,因为否则移动对象会使对象的现有指针无效。 Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance. 间接引用的效率相当低,特别是在现代CPU中,追逐指针会对性能产生可怕的影响。

What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use. 通常会发生什么事情,除非程序提供额外的谨慎,每个大块的内存分配brk()与一些仍在使用的小块。

The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. 操作系统无法判断程序是否认为此内存仍在使用中,因此它不能仅仅声明它。 Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. 由于程序不倾向于访问内存,因此OS通常会随着时间的推移将其交换掉,从而释放物理内存用于其他用途。 This is one of the reasons you should have swap space. 这是您应该拥有交换空间的原因之一。

It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python. 编写将内存交还给操作系统的程序是可能的,但我不确定你是否可以用Python来实现。

See also: 也可以看看:

So: this isn't actually a memory leak . 所以:这实际上不是内存泄漏 If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation. 如果你做了一些使用大量内存的事情,那么这个过程不应该增长很多,如果有的话,它将重新使用上一个大分配中先前释放的内存。

Joeblog has the correct answer. Joeblog有正确的答案。 The way you deal with the fetching is important but far more obvious than the way you must define the cursor. 处理提取的方式很重要,但比您必须定义光标的方式要明显得多。 Here is a simple example to illustrate this and give you something to copy-paste to start with. 这是一个简单的例子来说明这一点,并给你一些复制粘贴开始。

import datetime as dt
import psycopg2
import sys
import time

conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server

curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
   cptLigne += 1
   if cptLigne % 10000 == 0:
      print('.', end='')
      sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()

As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. 正如您将看到的那样,点组来得很快,而不是暂停以获得行缓冲(itersize),因此您不需要使用fetchmany来提高性能。 When I run this with /usr/bin/time -v , I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. 当我使用/usr/bin/time -v运行它/usr/bin/time -v ,我在不到3分钟的/usr/bin/time -v得到结果,仅使用200MB的RAM(而不是带有客户端游标的60GB)用于1000万行。 The server doesn't need more ram as it uses temporary table. 服务器不需要更多内存,因为它使用临时表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM