Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

Question

You may have seen my desperate frustrations over the past few weeks on here. I've been scraping some wait time data and am still unable to grab data from these two sites

http://www.centura.org/erwait

http://hcavirginia.com/home/

At first I tried BS4 for Python. Sample code below for HCA Virgina

from BeautifulSoup import BeautifulSoup
import requests

url = 'http://hcavirginia.com/home/'
r = requests.get(url)

soup = BeautifulSoup(r.text)
wait_times = [span.text for span in soup.findAll('span', attrs={'class': 'ehc-er-digits'})]

fd = open('HCA_Virginia.csv', 'a')

for w in wait_times:
    fd.write(w + '\n')

fd.close()

All this does is print blanks to the console or the CSV. So I tried it with PhantomJS since someone told me it may be loading with JS. Yet, same result! Prints blanks to console or CSV. Sample code below.

var page = require('webpage').create(),
url = 'http://hcavirginia.com/home/';

page.open(url, function(status) {
if (status !== "success") {
    console.log("Can't access network");
} else {
    var result = page.evaluate(function() {

        var list = document.querySelectorAll('span.ehc-er-digits'), time = [], i;
        for (i = 0; i < list.length; i++) {
            time.push(list[i].innerText);
        }
        return time;

    });
    console.log (result.join('\n'));
    var fs = require('fs');
    try 
    {                   
        fs.write("HCA_Virginia.csv", '\n' + result.join('\n'), 'a');
    } 
    catch(e) 
    {
        console.log(e); 
    } 
}

phantom.exit();
});

Same issues with Centura Health :(

What am I doing wrong?

Answer 1

The problem you're facing is that the elements are created by JS, and it might take some time to load them. You need a scraper which handles JS, and can wait until the required elements are created.

You can use PyQt4 . Adapting this recipe from webscraping.com and a HTML parser like BeautifulSoup , this is pretty easy:

(after writing this, I found the webscraping library for python. It might be worthy a look)

import sys
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import * 

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()   

url = 'http://hcavirginia.com/home/'
r = Render(url)
soup = BeautifulSoup(unicode(r.frame.toHtml()))
# In Python 3.x, don't unicode the output from .toHtml(): 
#soup = BeautifulSoup(r.frame.toHtml()) 
nums = [int(span) for span in soup.find_all('span', class_='ehc-er-digits')]
print nums

Output:

[21, 23, 47, 11, 10, 8, 68, 56, 19, 15, 7]

This was my original answer, using ghost.py :

I managed to hack something together for you using ghost.py . (tested on Python 2.7, ghost.py 0.1b3 and PyQt4-4 32-bit ). I wouldn't recommend to use this in production code though!

from ghost import Ghost
from time import sleep

ghost = Ghost(wait_timeout=50, download_images=False)
page, extra_resources = ghost.open('http://hcavirginia.com/home/',
                                   headers={'User-Agent': 'Mozilla/4.0'})

# Halt execution of the script until a span.ehc-er-digits is found in 
# the document
page, resources = ghost.wait_for_selector("span.ehc-er-digits")

# It should be possible to simply evaluate
# "document.getElementsByClassName('ehc-er-digits');" and extract the data from
# the returned dictionary, but I didn't quite understand the
# data structure - hence this inline javascript.
nums, resources = ghost.evaluate(
    """
    elems = document.getElementsByClassName('ehc-er-digits');
    nums = []
    for (i = 0; i < elems.length; ++i) {
        nums[i] = elems[i].innerHTML;
    }
    nums;
    """)

wt_data = [int(x) for x in nums]
print wt_data
sleep(30) # Sleep a while to avoid the crashing of the script. Weird issue!

Some comments:

As you can see from my comments, I didn't quite figure out the structure of the returned dict from Ghost.evaluate(document.getElementsByClassName('ehc-er-digits');) - its probably possible to find the information needed using such a query though.
I also had some problems with the script crashing at the end. Sleeping for 30 seconds fixed the issue.

Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

Question

1 answers

solution1
12 ACCPTED 2014-02-26 02:32:58

Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

Question

1 answers

solution1 12 ACCPTED 2014-02-26 02:32:58

solution1
12 ACCPTED 2014-02-26 02:32:58