Scraping a site using Selenium and BeautifulSoup

Question

So I'm trying to scrape a site that loads something dynamically with JS. My goal is to build a quick python script to load a site, see if there's a certain word, and then email me if it's there.

I'm relatively new to coding, so if there's a better way, I'd be happy to hear.

I'm currently working to load the page with Selenium, then scrape the generated page with BeautifulSoup, and that's where I'm having the issue. How do I get beautifulsoup to scrape the site I just opened in selenium?

from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import urllib, urllib2
import time


url = 'http://www.somesite.com/'

path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

site = browser.get(url)

html = urllib.urlopen(site).read()
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

I have an error that says

Traceback (most recent call last):
  File "probation color.py", line 16, in <module>
    html = urllib.urlopen(site).read()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open
    fullurl = unwrap(toBytes(fullurl))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap
    url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

which I don't really understand or understand why is happening. Is it something internally with urllib? How do I fix it? I think solving that will fix my problem.

Answer 1

The HTML can be found using the "page_source" attribute on the browser. This should work:

browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

Answer 2

from __future__ import print_function
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
#import urllib, urllib2
import time


url = 'http://www.somesite.com/'

path_to_chromedriver = '/Users/admin/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

site = browser.get(url)
html = site.page_source #you should have used this...

#html = urllib.urlopen(site).read() #this is the mistake u did...
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

Scraping a site using Selenium and BeautifulSoup

Question

2 answers

solution1
2 ACCPTED 2015-12-10 21:22:19

solution2
1 2016-10-11 09:46:45

Scraping a site using Selenium and BeautifulSoup

Question

2 answers

solution1 2 ACCPTED 2015-12-10 21:22:19

solution2 1 2016-10-11 09:46:45

solution1
2 ACCPTED 2015-12-10 21:22:19

solution2
1 2016-10-11 09:46:45