简体   繁体   English

Python下载会话时请求不清除内存

[英]Python-requests not clearing memory when downloading with sessions

I have an application where I use requests to download .mp3 files from a server. 我有一个应用程序,我使用请求从服务器下载.mp3文件。

The code looks like this: 代码如下所示:

self.client = requests.session(headers={'User-Agent': self.useragent})

def download(self, url, name):
    request = self.client.get(url)

    with open(name, "wb") as code:
        code.write(request.content)

    print "done"

The problem is that when the download is finished, python does not clear the memory, so everytime I download an mp3, the memory usage of the application raises by the size of the mp3. 问题是当下载完成后,python不会清除内存,因此每次下载mp3时,应用程序的内存使用量都会增加mp3的大小。 The memory does not get cleared again, leading to my app using a lot of memory. 内存不会再次被清除,导致我的应用程序使用大量内存。

I assume this has to do with how I save the file, or how requests.session works. 我假设这与我如何保存文件或者request.session如何工作有关。

Any suggestions. 有什么建议。

Edit: Here is the code: https://github.com/Simon1988/VK-Downloader 编辑:这是代码: https//github.com/Simon1988/VK-Downloader

The relevant part is in lib/vklib.py 相关部分位于lib / vklib.py中

You could try streaming the content in chunks: 您可以尝试以块的形式传输内容:

def download(self, url, name):
    request = self.client.get(url, stream=True)  # `prefetch=False` for older
                                                 # versions of requests
    with open(name, "wb") as code:
        for chunk in request.iter_content(1024):
            if not chunk:
                break

            code.write(chunk)

I don't think there's an actual problem here, beyond you not understanding how memory allocation works. 我不认为这里存在实际问题,除了你不了解内存分配如何工作。

When Python needs more memory, it asks the OS for more. 当Python需要更多内存时,它会要求操作系统提供更多内存。 When it's done with that memory, it generally does not return it to the OS; 当它完成该内存时,通常不会将其返回给操作系统; instead, it holds onto it for later objects. 相反,它保留在以后的对象上。

So, when you open the first 10MB mp3, your memory use goes from, say, 3MB to 13MB. 所以,当你打开第一个10MB的mp3时,你的内存使用量从3MB增加到13MB。 Then you free up that memory, but you're still at 13MB. 然后你释放了那个记忆,但你仍然是13MB。 Then you open a second 10MB mp3, but it reuses the same memory, so you're still at 13MB. 然后你打开第二个10MB的mp3,但它重用了相同的内存,所以你仍然是13MB。 And so on. 等等。

In your code, you're creating a thread for each download. 在您的代码中,您将为每次下载创建一个线程。 If you have 5 threads at a time, all using 10MB, obviously that means you're using 50MB. 如果你一次有5个线程,全部使用10MB,显然这意味着你使用50MB。 And that 50MB won't be released. 这50MB将不会发布。 But if you wait for them to finish, then do another 5 downloads, it'll reuse the same 50MB again. 但是如果你等待它们完成,那么再做5次下载,它将再次重复使用相同的50MB。

Since your code doesn't limit the number of threads in any way, there's nothing (short of CPU speed and context-switching costs) to stop you from kicking off hundreds of threads, each using 10MB, meaning gigabytes of RAM. 由于你的代码没有以任何方式限制线程数,因此没有任何东西(缺乏CPU速度和上下文切换成本)来阻止你开始使用数百个线程,每个线程使用10MB,这意味着数十亿字节的RAM。 But just switching to a thread pool, or not letting the user kick off more downloads if too many are gong on, etc., will solve that. 但是,只是切换到一个线程池,或者如果太多人没有让用户启动更多下载,等等,将解决这个问题。

So, usually, this is not a problem. 所以,通常,这不是问题。 But if it is, there are two ways around it: 但如果是这样,有两种方法:

  1. Create a child process (eg, via the multiprocessing module) to do the memory-hogging work. 创建子进程(例如,通过multiprocessing模块)来执行内存占用工作。 On any modern OS, when a process goes away, its memory is reclaimed. 在任何现代操作系统中,当进程消失时,其内存将被回收。 The problem here is that allocating and releasing 10MB over and over again is actually going to slow your system down, not speed it up—and the cost of process startup (especially on Windows) will make it even worse. 这里的问题是一次又一次地分配和释放10MB实际上会降低你的系统速度,而不是加速它 - 而且流程启动的成本(特别是在Windows上)会使它更糟。 So, you'll probably want a much larger batch of jobs to spin off to ac child process. 因此,您可能希望将更大批量的作业转移到ac子进程。

  2. Don't read the whole thing into memory at once; 不要立刻将整个内容读入内存; use a streaming API instead of a whole-file API. 使用流API而不是整个文件API。 With requests , this means setting stream=True in the initial request, and then usually using r.raw.read(8192) , r.iter_content() , or r.iter_lines() in a loop instead of accessing r.content . 对于requests ,这意味着在初始请求中设置stream=True ,然后通常在循环中使用r.raw.read(8192)r.iter_content()r.iter_lines()而不是访问r.content

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM