简体   繁体   中英

Python: Multiprocessing in a simple loop

My objective is to get the source code of various web pages with the Selenium Driver Package. To use idle time while opening pages, I would love to make use of multiprocessing . However, as I am new to multiprocessing, I do not manage to make my code work.

This is a simple function that serves as an example for what I would love to run in parallel ( selenium webdriver package and time package are required ):

def get_source(links):
    for i in range(len(links)):
        time.wait(3)
        driver.get(links[i])
        time.wait(3)
        print(driver.page_source)
        time.wait(3)
        print("Done with the page")

Different webpages are fed into this function, eg:

links = ["https://stackoverflow.com/questions/tagged/javascript","https://stackoverflow.com/questions/tagged/python","https://stackoverflow.com/questions/tagged/c%23","https://stackoverflow.com/questions/tagged/php"]

This is what I have so far. However, it does only spam instances of the webdriver instead of doing what it is intended to do, unfortunately.

if __name__ == '__main__':
    pool = Pool(2)
    pool.map(get_source(), links)

Any help is highly appreciated! Thanks a lot!

When using multiprocessing.pool , use the apply_async method to map a function to a list of parameters. Note that since the function is run asynchronously, you should pass some sort of index to the function and have it returned with the result. In this case, the function returns the URL along with the page source.

Try this code:

import multiprocessing as mp
import time
from selenium import webdriver

from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())

def get_source(link):  # single URL
    time.sleep(3)
    driver.get(link)
    time.sleep(3)
    print("Done with the page:", link)
    return (link, driver.page_source)   # return tuple: link & source
        
links = [
   "https://stackoverflow.com/questions/tagged/javascript",
   "https://stackoverflow.com/questions/tagged/python",
   "https://stackoverflow.com/questions/tagged/c%23",
   "https://stackoverflow.com/questions/tagged/php"
   ]

if __name__ == '__main__':
    pool = mp.Pool(processes=2)
    results = [pool.apply_async(get_source, args=(lnk,)) for lnk in links] # maps function to iterator
    output = [p.get() for p in results]   # collects and returns the results
    for r in output:
       print("len =", len(r[1]), "for link", r[0])   # read tuple elements

Output

Done with the page: https://stackoverflow.com/questions/tagged/python
Done with the page: https://stackoverflow.com/questions/tagged/javascript
Done with the page: https://stackoverflow.com/questions/tagged/c%23
Done with the page: https://stackoverflow.com/questions/tagged/php
len = 163045 for link https://stackoverflow.com/questions/tagged/javascript
len = 161512 for link https://stackoverflow.com/questions/tagged/python
len = 192744 for link https://stackoverflow.com/questions/tagged/c%23
len = 192678 for link https://stackoverflow.com/questions/tagged/php

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM