简体   繁体   English

Python:简单循环中的多处理

[英]Python: Multiprocessing in a simple loop

My objective is to get the source code of various web pages with the Selenium Driver Package.我的目标是使用 Selenium 驱动程序包获取各种网页源代码 To use idle time while opening pages, I would love to make use of multiprocessing .要在打开页面时使用空闲时间,我很想使用 multiprocessing However, as I am new to multiprocessing, I do not manage to make my code work.但是,由于我是多处理新手,我无法使我的代码正常工作。

This is a simple function that serves as an example for what I would love to run in parallel ( selenium webdriver package and time package are required ):这是一个简单的函数,作为我希望并行运行的示例(需要selenium webdriver 包和 time 包):

def get_source(links):
    for i in range(len(links)):
        time.wait(3)
        driver.get(links[i])
        time.wait(3)
        print(driver.page_source)
        time.wait(3)
        print("Done with the page")

Different webpages are fed into this function, eg:不同的网页被输入到这个函数中,例如:

links = ["https://stackoverflow.com/questions/tagged/javascript","https://stackoverflow.com/questions/tagged/python","https://stackoverflow.com/questions/tagged/c%23","https://stackoverflow.com/questions/tagged/php"]

This is what I have so far.这是我到目前为止。 However, it does only spam instances of the webdriver instead of doing what it is intended to do, unfortunately.然而,不幸的是,它只做 webdriver 的垃圾邮件实例,而不是做它打算做的事情。

if __name__ == '__main__':
    pool = Pool(2)
    pool.map(get_source(), links)

Any help is highly appreciated!任何帮助表示高度赞赏! Thanks a lot!非常感谢!

When using multiprocessing.pool , use the apply_async method to map a function to a list of parameters.使用multiprocessing.pool ,使用apply_async方法将函数映射到参数列表。 Note that since the function is run asynchronously, you should pass some sort of index to the function and have it returned with the result.请注意,由于该函数是异步运行的,您应该将某种索引传递给该函数并让它与结果一起返回。 In this case, the function returns the URL along with the page source.在这种情况下,该函数会返回 URL 以及页面源。

Try this code:试试这个代码:

import multiprocessing as mp
import time
from selenium import webdriver

from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())

def get_source(link):  # single URL
    time.sleep(3)
    driver.get(link)
    time.sleep(3)
    print("Done with the page:", link)
    return (link, driver.page_source)   # return tuple: link & source
        
links = [
   "https://stackoverflow.com/questions/tagged/javascript",
   "https://stackoverflow.com/questions/tagged/python",
   "https://stackoverflow.com/questions/tagged/c%23",
   "https://stackoverflow.com/questions/tagged/php"
   ]

if __name__ == '__main__':
    pool = mp.Pool(processes=2)
    results = [pool.apply_async(get_source, args=(lnk,)) for lnk in links] # maps function to iterator
    output = [p.get() for p in results]   # collects and returns the results
    for r in output:
       print("len =", len(r[1]), "for link", r[0])   # read tuple elements

Output输出

Done with the page: https://stackoverflow.com/questions/tagged/python
Done with the page: https://stackoverflow.com/questions/tagged/javascript
Done with the page: https://stackoverflow.com/questions/tagged/c%23
Done with the page: https://stackoverflow.com/questions/tagged/php
len = 163045 for link https://stackoverflow.com/questions/tagged/javascript
len = 161512 for link https://stackoverflow.com/questions/tagged/python
len = 192744 for link https://stackoverflow.com/questions/tagged/c%23
len = 192678 for link https://stackoverflow.com/questions/tagged/php

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM