简体   繁体   English

在多个进程之间分配python工作负载

[英]Distributing a python workload across multiple processes

Let us suppose that I want to do a google search for the word 'hello'. 让我们假设我想用Google搜索“ hello”一词。 I then want to go to every single link on the first 100 pages of Google and download the HTML of that linked page. 然后,我想转到Google的前100个页面上的每个链接,然后下载该链接页面的HTML。 Since there are 10 results per page, this would mean I'd have to click about 1,000 links. 由于每页有10条结果,这意味着我必须单击大约1,000个链接。

This is how I would do it with a single process: 这是我将通过单个过程执行的操作:

from selenium import webdriver
driver=webdriver.Firefox()
driver.get('http://google.com')

# do the search
search = driver.find_element_by_name('q')
search.send_keys('hello')
search.submit()

# click all the items
links_on_page = driver.find_elements_by_xpath('//li/div/h3/a')
for item in links_on_page:
    item.click()
    # do something on the page
    driver.back()

# go to the next page
driver.find_element_by_xpath('//*[@id="pnnext"]')

This would obviously take a very long time to do this on 100 pages. 要在100页上执行此操作,显然将花费很长时间。 How would I distribute the load, such that I could have (for example) three drivers open, and each would 'check out' a page. 我将如何分配负载,以使我可以(例如)打开三个驱动程序,每个驱动程序都会“检出”页面。 For example: 例如:

  • Driver #1 checks out page 1. Starts page 1. 驱动程序1签出页面1。开始页面1。
  • Driver #2 sees that page 1 is checked out and goes to page #2. 驱动程序#2看到页面1已检出并转到页面#2。 Starts page 2. 从第2页开始。
  • Driver #3 sees that page 1 is checked out and goes to page #2. 驱动程序#3看到页面1已检出并转到页面#2。 Same with page 2. Starts page 3. 与第2页相同。从第3页开始。
  • Driver #1 finishes work on page 1...starts page 4. 驱动程序#1在第1页上完成工作...从第4页开始。

I understand the principle of how this would work, but what would be the actual code to get a basic implementation of this working? 我了解这将如何工作的原理,但是获得此工作的基本实现的实际代码是什么?

You probably want to use a multiprocessing Pool . 您可能要使用multiprocessing Pool To do so, write a method that is parameterised by page number: 为此,编写一个由页码参数化的方法:

def get_page_data(page_number):
    # Fetch page data
    ...
    # Parse page data
    ...
    for linked_page in parsed_links:
        # Fetch page source and save to file
        ...

Then just use a Pool of however many processes you think is appropriate (determining this number will probably require some experimentation): 然后,只需使用一个Pool ,你认为是合适的(确定这个数字可能会需要一些经验),但很多过程的:

from multiprocessing import Pool

if __name__ == '__main__':
    pool = Pool(processes=4)
    pool.map(get_page_data, range(1,101))

This will now set 4 processes going, each fetching a page from Google and then fetching each of the pages it links to. 现在,这将设置4个进程,每个进程从Google抓取一个页面,然后抓取其链接到的每个页面。

Not answering your question directly, but proposing an avenue that might make your code usable in a single process, thereby avoiding you synchronisation issues between different threads/processes... 不是直接回答您的问题,而是提出一条可能使您的代码在单个进程中可用的途径,从而避免您在不同线程/进程之间进行同步问题...

You would probably better be with a framework such as Twisted which enables asynchronous network operations, in order to keep all operations within the same process. 您可能最好使用诸如Twisted之类的框架,该框架启用异步网络操作,以便将所有操作保持在同一进程中。 In your code, the parsing of the HTML code is likely going to take far less time than the complete network operations required to fetch the pages. 在您的代码中,解析HTML代码所需的时间可能比获取页面所需的完整网络操作所需的时间少得多。 Therefore, using asynchronous IO, you can launch a couple of requests at the same time and parse the result only as responses arrive. 因此,使用异步IO,您可以同时启动几个请求,并仅在响应到达时解析结果。 In effect, each time a page is returned, your process is likely to be "idling" in the runloop. 实际上,每次返回页面时,您的进程都可能在运行循环中处于“空闲”状态。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM