简体   繁体   English

如何减少线程python代码的内存使用?

[英]How to reduce memory usage of threaded python code?

I wrote about 50 classes that I use to connect and work with websites using mechanize and threading. 我写了大约50个类,用于连接和使用机械化和线程的网站。 They all work concurrently, but they don't depend on each other. 它们都同时工作,但它们并不相互依赖。 So that means 1 class - 1 website - 1 thread. 所以这意味着1个类 - 1个网站 - 1个线程。 It's not particularly elegant solution, especially for managing the code, since lot of the code repeats in each class (but not nearly enough to make it into one class to pass arguments, as some sites may require additional processing of retrieved data in middle of methods - like 'login' - that others might not need). 它不是特别优雅的解决方案,特别是对于管理代码,因为很多代码在每个类中重复(但不足以使它成为一个类来传递参数,因为一些站点可能需要在方法中间对检索到的数据进行额外处理 - 像'登录' - 其他人可能不需要)。 As I said, it's not elegant -- But it works. 正如我所说,它并不优雅 - 但它有效。 Needless to say I welcome all recommendations how to write this better without using 1 class for each website approach. 毋庸置疑,我欢迎所有建议如何更好地编写这个,而不使用每个网站方法的1个类。 Adding additional functionality or overall code management of each class is a daunting task. 添加每个类的附加功能或整体代码管理是一项艰巨的任务。

However, I found out, that each thread takes about 8MB memory, so with 50 running threads we are looking at about 400MB usage. 但是,我发现,每个线程占用大约8MB内存,因此使用50个正在运行的线程,我们正在考虑大约400MB的使用量。 If it was running on my system I wouldn't have problem with that, but since it's running on a VPS with only 1GB memory, it's starting to be an issue. 如果它在我的系统上运行我就不会有问题,但由于它在仅有1GB内存的VPS上运行,因此它开始出现问题。 Can you tell me how to reduce the memory usage, or are there any other way to to work with multiple sites concurrently? 你能告诉我如何减少内存使用量,还是有其他方法同时使用多个站点?

I used this quick test python program to test if it's the data stored in variables of my application that is using the memory, or something else. 我使用这个快速测试python程序来测试它是存储在我的应用程序的变量中的数据是使用内存还是其他东西。 As you can see in following code, it's only processing sleep() function, yet each thread is using 8MB of memory. 正如您在下面的代码中看到的,它只处理sleep()函数,但每个线程使用8MB内存。

from thread import start_new_thread
from time import sleep

def sleeper():
    try:
        while 1:
            sleep(10000)
    except:
        if running: raise

def test():
    global running
    n = 0
    running = True
    try:
        while 1:
            start_new_thread(sleeper, ())
            n += 1
            if not (n % 50):
                print n
    except Exception, e:
        running = False
        print 'Exception raised:', e
    print 'Biggest number of threads:', n

if __name__ == '__main__':
    test()

When I run this, the output is: 当我运行它时,输出是:

50
100
150
Exception raised: can't start new thread
Biggest number of threads: 188

And by removing running = False line, I can then measure free memory using free -m command in shell: 通过删除running = False行,我可以在shell中使用free -m命令测量空闲内存:

             total       used       free     shared    buffers     cached
Mem:          1536       1533          2          0          0          0
-/+ buffers/cache:       1533          2
Swap:            0          0          0

The actual calculation why I know it's taking about 8MB per thread is then simple by dividing dividing the difference of memory used before and during the the above test application is running, divided by maximum threads it managed to start. 通过将上述测试应用程序运行之前和期间使用的内存差异除以它设法启动的最大线程数,实际计算为什么我知道它每个线程大约需要8MB。

It's probably only allocated memory, because by looking at top , the python process uses only about 0.6% of memory. 它可能只分配了内存,因为通过查看top ,python进程仅使用大约0.6%的内存。

Using "one thread per request" is OK and easy for many use-cases. 使用“每个请求一个线程”对于许多用例来说都很容易。 However, it will require a lot of ressources (as you experienced). 但是,它需要大量的资源(正如您所经历的那样)。

A better approach is to use an asynchronuous one, but unfortunately it is a lot more complex. 更好的方法是使用异步方法,但不幸的是它要复杂得多。

Some hints into this direction: 一些暗示这个方向:

The solution is to replace code like this: 解决方案是替换这样的代码:

1) Do something. 1)做点什么。
2) Wait for something to happen. 2)等待事情发生。
3) Do something else. 3)做点别的事。

With code like this: 使用这样的代码:

1) Do something. 1)做点什么。
2) Arrange it so that when something happens, something else gets done. 2)安排它,以便在发生事情时,完成其他事情。
3) Done. 3)完成。

Somewhere else, you have a few threads that do this: 在其他地方,你有几个线程可以做到这一点:

1) Wait for anything to happen. 1)等待任何事情发生。
2) Handle whatever happened. 2)处理发生的事情。
3) Go to step 1. 3)转到步骤1。

In the first case, if you're waiting for 50 things to happen, you have 50 threads sitting around waiting for 50 things to happen. 在第一种情况下,如果你正在等待50件事情发生,你就有50个线程在等待50件事情发生。 In the second case, you have one thread waiting around that will do whichever of those 50 things need to get done. 在第二种情况下,你有一个等待的线程将执行这50个事情中的任何一个。

So, don't use a thread to wait for a single thing to happen. 所以,不要使用线程等待一件事发生。 Instead, arrange it so that when that thing happens, some other thread will do whatever needs to get done next. 相反,安排它,以便当事情发生时,其他一些线程将做下一步需要完成的任何事情。

I'm no expert on Python, but maybe have a few thread pools which control the total number of active threads, and hands off a 'request' to a thread once it's done with the previous thread. 我不是Python的专家,但可能有一些线程池可以控制活动线程的总数,并且一旦完成前一个线程,就会向线程发出“请求”。 The request doesn't have to be the full thread object, just enough data to complete whatever the request is. 请求不必是完整的线程对象,只需要足够的数据来完成请求。

You could also structure it so you have thread pool A with N threads pinging the website, once the data is retrieved, hand it off the data to thread pool B with Y threads crunching the data. 您也可以构建它,以便您拥有线程池A,其中N个线程ping网站,一旦检索到数据,将数据交给线程池B,Y线程处理数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM