Scraping HTML and JavaScript

Question

I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.

I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script> tag.

Any ideas how to do this?

Answer 1

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.

For example, you can use PhantomJS with Chrome or Firefox which both support headless mode.

For a more complete list of headless browsers check here .

Answer 2

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
use tools that utilizes real browsers like selenium . In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
see if the website provides an API (eg walmart API )

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

Also see these resources:

Hope that helps.

Answer 3

To get you started with selenium and BeautifulSoup:

Install phantomjs with npm (Node Package Manager):

apt-get install nodejs
npm install phantomjs

install selenium:

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup as usual:

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

Answer 4

A very fast way would be to iterate through all the tags and get textContent This is the JS snippet:

page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent;

or in selenium/python:

import selenium
from selenium import webdriver
driver = webdriver.Chrome()

driver.get("http://ranprieur.com")
pagetext = driver.execute_script('page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent; return page;')

Scraping HTML and JavaScript

Question

4 answers

solution1
4 ACCPTED 2014-03-31 14:34:48

solution2
2 2014-03-31 14:34:49

solution3
0 2014-03-31 15:35:35

solution4
0 2018-02-20 01:48:09

Scraping HTML and JavaScript

Question

4 answers

solution1 4 ACCPTED 2014-03-31 14:34:48

solution2 2 2014-03-31 14:34:49

solution3 0 2014-03-31 15:35:35

solution4 0 2018-02-20 01:48:09

solution1
4 ACCPTED 2014-03-31 14:34:48

solution2
2 2014-03-31 14:34:49

solution3
0 2014-03-31 15:35:35

solution4
0 2018-02-20 01:48:09