简体   繁体   English

使用Selenium和BeautifulSoup搜寻网站

[英]Scraping a site using Selenium and BeautifulSoup

So I'm trying to scrape a site that loads something dynamically with JS. 因此,我正在尝试抓取一个使用JS动态加载内容的网站。 My goal is to build a quick python script to load a site, see if there's a certain word, and then email me if it's there. 我的目标是建立一个快速的python脚本来加载网站,查看是否有某个单词,然后向我发送电子邮件。

I'm relatively new to coding, so if there's a better way, I'd be happy to hear. 我是编码的新手,所以如果有更好的方法,我很高兴听到。

I'm currently working to load the page with Selenium, then scrape the generated page with BeautifulSoup, and that's where I'm having the issue. 我目前正在使用Selenium加载页面,然后使用BeautifulSoup刮取生成的页面,这就是我遇到的问题。 How do I get beautifulsoup to scrape the site I just opened in selenium? 我如何获得beautifulsoup来刮除刚在硒中打开的网站?

from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import urllib, urllib2
import time


url = 'http://www.somesite.com/'

path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

site = browser.get(url)

html = urllib.urlopen(site).read()
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

I have an error that says 我有一个错误,说

Traceback (most recent call last):
  File "probation color.py", line 16, in <module>
    html = urllib.urlopen(site).read()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open
    fullurl = unwrap(toBytes(fullurl))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap
    url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

which I don't really understand or understand why is happening. 我不是很了解,也不了解为什么会这样。 Is it something internally with urllib? urllib在内部吗? How do I fix it? 我如何解决它? I think solving that will fix my problem. 我认为解决该问题将解决我的问题。

The HTML can be found using the "page_source" attribute on the browser. 可以使用浏览器上的“ page_source”属性找到HTML。 This should work: 这应该工作:

browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())
from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
#import urllib, urllib2
import time


url = 'http://www.somesite.com/'

path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

site = browser.get(url)
html = site.page_source #you should have used this...

#html = urllib.urlopen(site).read() #this is the mistake u did...
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM