使用Selenium和BeautifulSoup搜寻网站

Question

因此，我正在尝试抓取一个使用JS动态加载内容的网站。 我的目标是建立一个快速的python脚本来加载网站，查看是否有某个单词，然后向我发送电子邮件。

我是编码的新手，所以如果有更好的方法，我很高兴听到。

我目前正在使用Selenium加载页面，然后使用BeautifulSoup刮取生成的页面，这就是我遇到的问题。 我如何获得beautifulsoup来刮除刚在硒中打开的网站？

from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import urllib, urllib2
import time


url = 'http://www.somesite.com/'

path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

site = browser.get(url)

html = urllib.urlopen(site).read()
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

我有一个错误，说

Traceback (most recent call last):
  File "probation color.py", line 16, in <module>
    html = urllib.urlopen(site).read()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open
    fullurl = unwrap(toBytes(fullurl))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap
    url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

我不是很了解，也不了解为什么会这样。 urllib在内部吗？ 我如何解决它？ 我认为解决该问题将解决我的问题。

Answer 1

可以使用浏览器上的“ page_source”属性找到HTML。 这应该工作：

browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

Answer 2

from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
#import urllib, urllib2
import time


url = 'http://www.somesite.com/'

path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

site = browser.get(url)
html = site.page_source #you should have used this...

#html = urllib.urlopen(site).read() #this is the mistake u did...
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

使用Selenium和BeautifulSoup搜寻网站

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-12-10 21:22:19

解决方案2
1 2016-10-11 09:46:45

使用Selenium和BeautifulSoup搜寻网站

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-12-10 21:22:19

解决方案2 1 2016-10-11 09:46:45

解决方案1
2 已采纳 2015-12-10 21:22:19

解决方案2
1 2016-10-11 09:46:45