简体   繁体   English

使用Python和Selenium Webdriver爬取JavaScript

[英]Scraping javascript with Python and Selenium Webdriver

I'm trying to scrape the ads from Ask, which are generated in an iframe by a JS hosted by Google. 我正在尝试从Ask中抓取广告,这些广告是由Google托管的JS在iframe中生成的。

When I manually navigate my way through, and view source, there they are (I'm specifically looking for a div with the id "adBlock", which is in an iframe). 当我手动浏览并查看源代码时,它们就在那里了(我专门在iframe中寻找ID为“ adBlock”的div)。

But when I try using Firefox, Chromedriver or FirefoxPortable, the source returned to me is missing all of the elements I'm looking for. 但是,当我尝试使用Firefox,Chromedriver或FirefoxPortable时,返回给我的源代码缺少我正在寻找的所有元素。

I tried scraping with urllib2 and had the same results, even when adding in the necessary headers. 我尝试使用urllib2进行抓取,即使添加了必要的标头,也得到了相同的结果。 I thought for sure that a physical browser instance like Webdriver creates would have fixed that problem. 我确定可以肯定,像Webdriver这样创建的物理浏览器实例将解决该问题。

Here's the code I'm working off of, which had to be cobbled together from a few different sources: 这是我正在处理的代码,必须从几个不同的来源中将它们拼凑在一起:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pprint

# Create a new instance of the Firefox driver
driver = webdriver.Chrome('C:\Python27\Chromedriver\chromedriver.exe')
driver.get("http://www.ask.com")

print driver.title
inputElement = driver.find_element_by_name("q")

# type in the search
inputElement.send_keys("baseball hats")
# submit the form (although google automatically searches now without submitting)
inputElement.submit()

try:
    WebDriverWait(driver, 10).until(EC.title_contains("baseball"))
    print driver.title
    output = driver.page_source
    print(output)
finally:
    driver.quit()

I know I circle through a few different attempts at viewing the source, that's not what I'm concerned about. 我知道我在浏览源代码时进行了几种不同的尝试,这不是我所关心的。

Any thoughts as to why I'm getting one result from this script (ads omitted) and a totally different result (ads present) from the browser it opened in? 关于为什么我会从此脚本中得到一个结果(省略广告)和从其打开的浏览器中获得完全不同的结果(存在广告)的想法? I've tried Scrapy, Selenium, Urllib2, etc. No joy. 我已经尝试过Scrapy,Selenium,Urllib2等。不高兴。

Selenium only displays the contents of the current frame or iframe. Selenium仅显示当前框架或iframe的内容。 You'll have to switch into the iframes using something along these lines 您必须按照以下步骤使用iframe切换到iframe

iframes = driver.find_elements_by_tag_name("iframe")

for iframe in iframes
    driver.switch_to_default_content()
    driver.switch_to_frame(iframe)

    output = driver.page_source
    print(output)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM