简体   繁体   English

主循环不等待 python 多处理池完成并冻结

[英]Main loop does not wait for python multiprocessing pool to finish & freeze

It's my first python project after 10 years and my first experience with python multiprocessing, so there may just be some very basic mistakes I haven't seen.这是我 10 年后的第一个 python 项目,也是我第一次体验 python 多处理,所以可能只有一些我没见过的非常基本的错误。

I'm stuck with python and a multiprocessing web crawler.我坚持使用 python 和多处理网络爬虫。 My crawler checks a main page for changes and then iterates through subcategories in parallel, adding items to a list.我的爬虫检查主页面的更改,然后并行迭代子类别,将项目添加到列表中。 These items are then checked in parallel and extracted via selenium (as I couldn't figure out how to do it otherwise, because content is dynamically loaded into the page when clicking the items).然后并行检查这些项目并通过 selenium 提取(因为我无法弄清楚如何否则,因为单击项目时内容会动态加载到页面中)。

Main loop:主循环:

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import time
from bs4 import BeautifulSoup
import pickledb
import random
import multiprocessing
import itertools

import config


requests.packages.urllib3.disable_warnings(InsecureRequestWarning)


def getAllSubCategories(pageNumber, items):
    # check website and look for subcategories that are "worth" extracting
    url = 'https://www.google.com' + str(pageNumber)
    response = requests.get(url, verify=False, headers=config.headers, cookies=config.cookies)
    pageSoup = BeautifulSoup(response.content, features='html.parser')
    elements = soup.find(...)
    if not elements: # website not loading properly
        return getAllSubCategories(items)

    for element in elements:
        items.append(element)


def checkAndExtract(item, ignoredItems, itemsToIgnore):
    # check if items are already extracted; if not, extract them if they contain a keyword
    import checker
    import extractor

    if item not in ignoredItems:
        if checker.check(item):
            extractor.extract(item, itemsToIgnore)
        else: itemsToIgnore.append(item)


if __name__ == '__main__':
    multiprocessing.freeze_support()

    itemsToIgnore = multiprocessing.Manager().list()

    crawlUrl = 'https://www.google.com/'
    db = pickledb.load('myDB.db', False)

    while True:
        try:
            # check main website for changes
            response = requests.get(crawlUrl, verify=False, headers=config.headers, cookies=config.cookies)
            soup = BeautifulSoup(response.content, features='html.parser')
            mainCondition = soup.find(...)

            if mainCondition:
                numberOfPages = soup.find(...)

                ignoredItems = db.get('ignoredItems')
                if not ignoredItems:
                    db.lcreate('ignoredItems')
                    ignoredItems = db.get('ignoredItems')

                items = multiprocessing.Manager().list()
                # get all items from subcategories
                with multiprocessing.Pool(30) as pool:
                    pool.starmap(getAllSubCategories, zip(range(numberOfPages, 0, -1), itertools.repeat(items)))

                itemsToIgnore[:] = []
                # loop through all items
                with multiprocessing.Pool(30) as pool:
                    pool.starmap(checkAndExtract, zip(items, itertools.repeat(ignoredItems), itertools.repeat(itemsToIgnore)))

                for item in itemsToIgnore:
                    if item not in db.get('ignoredItems'): db.ladd('ignoredItems', item)
                db.dump()

            time.sleep(random.randint(10, 20))
        except KeyboardInterrupt:
            break
        except Exception as e:
            print(e)
            continue

Checker:检查器:

import config

def check(item):
    title = item...
    try:
        for keyword in config.keywords: # just a string array
            if keyword.lower() in title.lower():
                return True
    except Exception as e:
        print(e)

    return False

Extractor:提取器:

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time

import config

def extract(item, itemsToIgnore):
    driver = webdriver.Chrome('./chromedriver')
    driver.implicitly_wait(3)
    driver.get('https://www.google.com')
    for key in config.cookies:
        driver.add_cookie({'name': key, 'value': config.cookies[key], 'domain': '.google.com'})
    try:
        driver.get('https://www.google.com')

        wait = WebDriverWait(driver, 10)
        if driver.title == 'Page Not Found':
            extract(item, itemsToIgnore)
            return

        driver.find_element_by_xpath('...').click()
        time.sleep(1)
        button = wait.until(EC.element_to_be_clickable((By.XPATH, '...')))
        button.click()
        # and some extraction magic
    except:
        extract(item, itemsToIgnore) # try again

Everything is working fine and some test runs were successful.一切正常,一些测试运行成功。 But sometimes the loop would start again before the pool has finished its work.但有时循环会在池完成其工作之前再次开始。 In the logs I can see how the item checker returns true, but the extractor is not even starting and the main process begins the next iteration:在日志中,我可以看到项目检查器如何返回 true,但提取器甚至没有启动,主进程开始下一次迭代:

2019-12-23 00:21:16,614 [SpawnPoolWorker-6220] [INFO ] check returns true
2019-12-23 00:21:18,142 [MainProcess         ] [DEBUG] starting next iteration
2019-12-23 00:21:39,630 [SpawnPoolWorker-6247] [INFO ] checking subcategory

Also I guess that the pool does not clean up somehow as I doubt the SpawnPoolWorker-XXXX number should be that high.此外,我猜池不会以某种方式清理,因为我怀疑SpawnPoolWorker-XXXX数字应该那么高。 It also freezes after ~1 hour.大约 1 小时后它也会冻结。 This may be connected to this issue.这可能与这个问题有关。

I fixed the loop issue with either switching from Win7 to Win10 or switching from starmap to starmap_async and calling get() on the result afterwards.我通过从 Win7 切换到 Win10 或从 starmap 切换到 starmap_async 并在之后对结果调用 get() 来修复循环问题。

The freeze was most probably caused by calling requests.get() without passing a value for timeout.冻结很可能是由于调用 requests.get() 而不传递超时值引起的。

You may try this for your pool jobs:您可以为池作业尝试此操作:

poolJob1 = pool.starmap(getAllSubCategories, zip(range(numberOfPages, 0, -1), itertools.repeat(items)))
poolJob1.close()
poolJob1.join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM