简体   繁体   English

Python使用Selenium和Beautiful Soup抓取JavaScript

[英]Python Scraping JavaScript using Selenium and Beautiful Soup

I'm trying to scrape a JavaScript enables page using BS and Selenium. 我正在尝试使用BS和Selenium抓取JavaScript启用页面。 I have the following code so far. 到目前为止,我有以下代码。 It still doesn't somehow detect the JavaScript (and returns a null value). 它仍然无法以某种方式检测到JavaScript(并返回空值)。 In this case I'm trying to scrape the Facebook comments in the bottom. 在这种情况下,我尝试在底部刮取Facebook评论。 (Inspect element shows the class as postText) (检查元素将类显示为postText)
Thanks for the help! 谢谢您的帮助!

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup.BeautifulSoup(html_source)  
comments = soup("div", {"class":"postText"})  
print comments

There are some mistakes in your code that are fixed below. 您的代码中有一些错误已在下面修复。 However, the class "postText" must exist elsewhere, since it is not defined in the original source code. 但是,类“ postText”必须存在于其他位置,因为它没有在原始源代码中定义。 My revised version of your code was tested and is working on multiple websites. 我对您代码的修订版本已通过测试,并且可以在多个网站上使用。

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})  
print comments

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM