简体   繁体   中英

how to use chrome webdriver headless with .pac file proxy in python?

Here's the situation: I have a.pac url as proxy. In Ubuntu, the proxy could be use as network proxy been set as automatic mode and fill the.pac url in Configuration URL. When i use python to crawling from Google Image, the request to google won't work. So i use selenium's chrome webdriver to simulate uses's mouse & keyboard action and its work. Then i add the '--headless' argument to increase the amount of concurrency, and i got a TimeoutException.

Then i download the.pac file and try to use "options.add_argument('--proxy-pac-url=xxx.pac')" to solve this problem, but the proxy still won't work.

And i got a solution which use a chrome extension called 'SwitchyOmega' to use.pac file proxy. When i download the latest release from github and use "options.add_extension('xxx/SwitchyOmega_Chromium.crx')" to load the extension, and i got:"from unknown error: CRX verification failed: 3"

At last, i configure SwitchyOmega in chrome and use developer tools pack the local extension file to.crx and the extension was load correctly in webdriver. But i found the extension is unconfigured.

So how can i fix this proxy problem, thanks!

Here is my code:

class GoogleCrawler:

def __init__(self):
    driver_executable = self.get_driver_executable()
    options = webdriver.ChromeOptions()
    options.add_argument('blink-settings=imagesEnabled=false')
    # options.add_argument('--headless')
    # options.add_argument('--proxy-pac-url=./xxx.pac')
    # options.add_extension('./SwitchyOmega_Chromium.crx')
    self.browser = webdriver.Chrome(driver_executable,
                                    chrome_options=options)
    self.driver_version_check()

def get_google_image_urls(self, keyword):
    self.browser.get(f'https://www.google.com/search?q={keyword}&tbm=isch')
    time.sleep(2)

    img_urls = []
    first_thumbnail_image_xpath = '//div[@data-ri="0"]'
    image_xpath = '//div[@class="irc_c i8187 immersive-container"]//img[@class="irc_mi"]'
    body_element = self.browser.find_element_by_tag_name('body')

    wait = WebDriverWait(self.browser, 15)
    first_thumbnail_image = wait.until(
        element_to_be_clickable((By.XPATH, first_thumbnail_image_xpath)))
    first_thumbnail_image.click()

    scroll_flag = 0
    last_scroll_distance = 0
    while scroll_flag <= 50:
        image_elements = self.browser.find_elements(By.XPATH, image_xpath)
        img_urls.extend([
            image_element.get_attribute('src')
            for image_element in image_elements
        ])

        body_element.send_keys(Keys.RIGHT)

        scroll_distance = self.browser.execute_script(
            'return window.pageYOffset;')
        if scroll_distance == last_scroll_distance:
            scroll_flag += 1
        else:
            last_scroll_distance = scroll_distance
            scroll_flag = 0

    self.browser.close()
    img_urls = set(img_urls)
    print(
        f'[INFO]Scraping Image urls DONE: Keyword: {keyword}, Total: {len(img_urls)}'
    )
    return keyword, img_urls

Since headless Chrome doesn't support PAC files , and since it doesn't support Chrome Extensions, I don't think this were is way to make this work with PAC files for you.

Can you run your own proxy, with logic in that proxy, and pass that to the --proxy-server Chrome flag.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM