简体   繁体   English

Python多处理-一个进程占用大量内存

[英]Python multiprocessing - one process taking huge memory

I am trying to crawl though multiple web-pages and gather data. 我试图通过多个网页进行爬网并收集数据。

pool = multiprocessing.Pool(4, maxtasksperchild=1000)
ret = pool.map(get_data_for_somthing, some_list) #ret not useful

Each process in-turn creates more threads (using threading API) For example if there is paging on web-page then threads will be created to access each page (url) at the same time. 每个进程依次创建更多线程(使用线程API),例如,如果网页上有页面调度,则将创建线程以同时访问每个页面(URL)。

All the PROCESSES collect data and dump in a csv (pandas is used). 所有过程都收集数据并转储到csv(使用熊猫)中。 The individual CSV file in not more than 500KB. 单个CSV文件的大小不得超过500KB。

try:
    dt = get_data_from_wb1(id, start=start, end=end)
    nsdf = get_data_from_wb2(id, start=start, end=end)
    if not nsdf.empty:
        nsdf.drop("Label", axis=1, inplace=True)
        nsdf.insert(0, "some_label", nsdf.index)
        nsdf.insert(0, "id", id)
        nsdf.columns = dbcols
        nsdf["label_new"] = dt["label_new"]
        nsedf.to_csv(path + variable + ".csv")
    else:
        raise Exception("returned null")
except Exception as e:
    logger_map.get(multiprocessing.current_process().name, setup_logger()).error(variable+ " : " + vriable2 + " : " + str(e.args[0]))

The above code shows what each process does and within "get_data_" functions more threads are created. 上面的代码显示了每个进程的作用,并在“ get_data_”函数中创建了更多线程。

I have windows on core i7 quad core. 我在核心i7四核上有Windows。 So should I spawn 3 processes or 4? 那么我应该生成3个进程还是4个进程? As one is the main processes. 作为主要过程之一。

MAIN question : One of the spawned processes takes up huge memory (5GB) whereas others take around 100-200MB. 主要问题:产生的一个进程占用大量内存(5GB),而其他进程占用100-200MB。 Why is this happening? 为什么会这样呢?

I cannot put the code here, so please dont downvote the question. 我无法将代码放在这里,所以请不要投票否决这个问题。 But can somebody point me in the right direction as to why 1 process ends up taking so much memory? 但是有人能为我指出正确的方向,为什么1个进程最终会占用大量内存吗?

You will have make the worker processes produce some debugging output to be able to answer your question. 您将使工作进程产生一些调试输出以能够回答您的问题。

Use eg the logging module to document when threads are started and finished, how many URLs they find, how long processing of an URL takes. 使用例如logging模块来记录线程的启动和结束时间,找到的URL数量,处理URL所需的时间。 The output of that will probably lead you to further questions. 这样的输出可能会导致您提出其他问题。

Maybe some pages contain links to themselves, sending your program into an infinite loop. 也许某些页面包含指向自身的链接,从而使您的程序陷入无限循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM