简体   繁体   中英

Python multiprocessing - one process taking huge memory

I am trying to crawl though multiple web-pages and gather data.

pool = multiprocessing.Pool(4, maxtasksperchild=1000)
ret = pool.map(get_data_for_somthing, some_list) #ret not useful

Each process in-turn creates more threads (using threading API) For example if there is paging on web-page then threads will be created to access each page (url) at the same time.

All the PROCESSES collect data and dump in a csv (pandas is used). The individual CSV file in not more than 500KB.

try:
    dt = get_data_from_wb1(id, start=start, end=end)
    nsdf = get_data_from_wb2(id, start=start, end=end)
    if not nsdf.empty:
        nsdf.drop("Label", axis=1, inplace=True)
        nsdf.insert(0, "some_label", nsdf.index)
        nsdf.insert(0, "id", id)
        nsdf.columns = dbcols
        nsdf["label_new"] = dt["label_new"]
        nsedf.to_csv(path + variable + ".csv")
    else:
        raise Exception("returned null")
except Exception as e:
    logger_map.get(multiprocessing.current_process().name, setup_logger()).error(variable+ " : " + vriable2 + " : " + str(e.args[0]))

The above code shows what each process does and within "get_data_" functions more threads are created.

I have windows on core i7 quad core. So should I spawn 3 processes or 4? As one is the main processes.

MAIN question : One of the spawned processes takes up huge memory (5GB) whereas others take around 100-200MB. Why is this happening?

I cannot put the code here, so please dont downvote the question. But can somebody point me in the right direction as to why 1 process ends up taking so much memory?

You will have make the worker processes produce some debugging output to be able to answer your question.

Use eg the logging module to document when threads are started and finished, how many URLs they find, how long processing of an URL takes. The output of that will probably lead you to further questions.

Maybe some pages contain links to themselves, sending your program into an infinite loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM