[英]Python: Multiprocessing in a simple loop
My objective is to get the source code of various web pages with the Selenium Driver Package.我的目标是使用 Selenium 驱动程序包获取各种网页的源代码。 To use idle time while opening pages, I would love to make use of multiprocessing .要在打开页面时使用空闲时间,我很想使用 multiprocessing 。 However, as I am new to multiprocessing, I do not manage to make my code work.但是,由于我是多处理新手,我无法使我的代码正常工作。
This is a simple function that serves as an example for what I would love to run in parallel ( selenium webdriver package and time package are required ):这是一个简单的函数,作为我希望并行运行的示例(需要selenium webdriver 包和 time 包):
def get_source(links):
for i in range(len(links)):
time.wait(3)
driver.get(links[i])
time.wait(3)
print(driver.page_source)
time.wait(3)
print("Done with the page")
Different webpages are fed into this function, eg:不同的网页被输入到这个函数中,例如:
links = ["https://stackoverflow.com/questions/tagged/javascript","https://stackoverflow.com/questions/tagged/python","https://stackoverflow.com/questions/tagged/c%23","https://stackoverflow.com/questions/tagged/php"]
This is what I have so far.这是我到目前为止。 However, it does only spam instances of the webdriver instead of doing what it is intended to do, unfortunately.然而,不幸的是,它只做 webdriver 的垃圾邮件实例,而不是做它打算做的事情。
if __name__ == '__main__':
pool = Pool(2)
pool.map(get_source(), links)
Any help is highly appreciated!任何帮助表示高度赞赏! Thanks a lot!非常感谢!
When using multiprocessing.pool
, use the apply_async
method to map a function to a list of parameters.使用multiprocessing.pool
,使用apply_async
方法将函数映射到参数列表。 Note that since the function is run asynchronously, you should pass some sort of index to the function and have it returned with the result.请注意,由于该函数是异步运行的,您应该将某种索引传递给该函数并让它与结果一起返回。 In this case, the function returns the URL along with the page source.在这种情况下,该函数会返回 URL 以及页面源。
Try this code:试试这个代码:
import multiprocessing as mp
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
def get_source(link): # single URL
time.sleep(3)
driver.get(link)
time.sleep(3)
print("Done with the page:", link)
return (link, driver.page_source) # return tuple: link & source
links = [
"https://stackoverflow.com/questions/tagged/javascript",
"https://stackoverflow.com/questions/tagged/python",
"https://stackoverflow.com/questions/tagged/c%23",
"https://stackoverflow.com/questions/tagged/php"
]
if __name__ == '__main__':
pool = mp.Pool(processes=2)
results = [pool.apply_async(get_source, args=(lnk,)) for lnk in links] # maps function to iterator
output = [p.get() for p in results] # collects and returns the results
for r in output:
print("len =", len(r[1]), "for link", r[0]) # read tuple elements
Output输出
Done with the page: https://stackoverflow.com/questions/tagged/python
Done with the page: https://stackoverflow.com/questions/tagged/javascript
Done with the page: https://stackoverflow.com/questions/tagged/c%23
Done with the page: https://stackoverflow.com/questions/tagged/php
len = 163045 for link https://stackoverflow.com/questions/tagged/javascript
len = 161512 for link https://stackoverflow.com/questions/tagged/python
len = 192744 for link https://stackoverflow.com/questions/tagged/c%23
len = 192678 for link https://stackoverflow.com/questions/tagged/php
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.