简体   繁体   English

PhantomJS返回空网页(python,Selenium)

[英]PhantomJS returning empty web page (python, Selenium)

Trying to screen scrape a web site without having to launch an actual browser instance in a python script (using Selenium). 尝试屏幕抓取网站而不必在python脚本中启动实际的浏览器实例(使用Selenium)。 I can do this with Chrome or Firefox - I've tried it and it works - but I want to use PhantomJS so it's headless. 我可以用Chrome或Firefox做到这一点 - 我已经尝试了它并且它有效 - 但我想使用PhantomJS所以它是无头的。

The code looks like this: 代码如下所示:

import sys
import traceback
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)

try:
    # Choose our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap)
    #browser = webdriver.PhantomJS()
    #browser = webdriver.Firefox()
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    browser.get("https://www.whatever.com")

    # For debug, see what we got back
    html_source = browser.page_source
    with open('out.html', 'w') as f:
        f.write(html_source)

    # PROCESS THE PAGE (code removed)

except Exception, e:
    browser.save_screenshot('screenshot.png')
    traceback.print_exc(file=sys.stdout)

finally:
    browser.close()

The output is merely: 输出仅仅是:

<html><head></head><body></body></html>

But when I use the Chrome or Firefox options, it works fine. 但是当我使用Chrome或Firefox选项时,它可以正常工作。 I thought maybe the web site was returning junk based on the user agent, so I tried faking that out. 我想也许这个网站根据用户代理返回垃圾,所以我试着把它伪装掉。 No difference. 没有不同。

What am I missing? 我错过了什么?

UPDATED: I will try to keep the below snippet updated with until it works. 更新:我会尽量保持下面的代码段更新,直到它工作。 What's below is what I'm currently trying. 以下是我目前正在尝试的内容。

import sys
import traceback
import time
import re

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")

try:
    # Set up our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    print "getting web page..."
    browser.get("https://www.website.com")

    # Need to wait for the page to load
    timeout = 10
    print "waiting %s seconds..." % timeout
    wait = WebDriverWait(browser, timeout)
    element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
    print "done waiting. Response:"

    # Rest of code snipped. Fails as "wait" above.

I was facing the same problem and no amount of code to make the driver wait was helping. 我遇到了同样的问题,没有多少代码让驱动程序等待有所帮助。
The problem is the SSL encryption on the https websites, ignoring them will do the trick. 问题是https网站上的SSL加密,忽略它们就可以解决问题。

Call the PhantomJS driver as: 将PhantomJS驱动程序称为:

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])

This solved the problem for me. 这解决了我的问题。

You need to wait for the page to loa d. 你需要等待页面出来 Usually, it is done by using an Explicit Wait to wait for a key element to be present or visible on a page . 通常,通过使用显式等待等待关键元素在页面上出现或可见 For instance: 例如:

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


# ...
browser.get("https://www.whatever.com")

wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))

html_source = browser.page_source
# ...

Here, we'll wait up to 10 seconds for a div element with class="content" to become visible before getting the page source. 在这里,我们将等待最多10秒钟div元素,其中class="content"在获取页面源之前变得可见。


Additionally, you may need to ignore SSL errors : 此外,您可能需要忽略SSL错误

browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])

Though, I'm pretty sure this is related to the redirecting issues in PhantomJS . 虽然,我很确定这与PhantomJS的重定向问题有关。 There is an open ticket in phantomjs bugtracker: phantomjs有一张开放票:

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1']) driver = webdriver.PhantomJS(service_args = [' - ignore-ssl-errors = true',' - ssl-protocol = TLSv1'])

This worked for me 这对我有用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM