Is there an easy way to parse an HTML document and remove everything except a particular table?

Question

I'm trying to patch together a quick utility that will read records from a website's table and insert them into a database. There are a few conditions:

The page's source is messy . Lots of CSS and Javascript thrown about. (It's an internal site.)
I know the ID of the table I want.
Once I have the table, I have to parse the rows further to get the specific informtation I'm looking for.
This has to be done server-side. (Preferably java, python, or C++, although if there is another particularly good option, that's fine too.)

So does anyone know of a good library or utility that can "grab" an html element by ID and let me parse it? I know it goes without saying, but I'd prefer one that's as quick as possible.

Answer 1

jQuery can select an element by it's Id.

See use-jquery-to-extract-data-from-html-lists-and-tables

The gist of the article is:

var tableObject = $('#myTable tbody tr').map(function(i) {
  var row = {};

  // Find all of the table cells on this row.
  $(this).find('td').each(function(i) { //do something with each td }

Answer 2

You could use a regular expression:

<table[^>]*id="whatever"[^>]*>(.*?)</table>

Then extract the first group (the part of the match in parens) and parse out the rows:

<tr[^>]*>(.*?)</tr>

Finally, with each row, extract the cells:

<td[^>]*>(.*?)</td>

This would work in any of the languages you mentioned.

Answer 3

You could use lxml library in Python:

#!/usr/bin/env python
import urllib2
from lxml import html # $ apt-get install python-lxml or $ pip install lxml

page = urllib2.urlopen('http://stackoverflow.com/q/11939631')
doc = html.parse(page).getroot()

div = doc.get_element_by_id('question')
for tr in div.find('table').iterchildren('tr'):
    for td in tr.iterchildren('td'):
        print(td.text_content()) # process td

If you are familiar with jQuery; you could use pyquery . It adds jQuery interface on top of lxml:

#!/usr/bin/env python
from pyquery import PyQuery # $ apt-get install python-pyquery or
                            # $ pip install pyquery

# d is like the $ in jquery
d = PyQuery(url='http://stackoverflow.com/q/11939631', parser='html')
for tr in d("#question table > tr"):
    for td in tr.iterchildren('td'):
        print(td.text_content())

Though in this case pyquery doesn't add enough. Here's the same using only lxml :

#!/usr/bin/env python
import urllib2
from lxml import html

page = urllib2.urlopen('http://stackoverflow.com/q/11939631')
doc = html.parse(page).getroot()
for tr in doc.cssselect('#question table > tr'):
    for td in tr.iterchildren('td'):
        print(td.text_content()) # process td

Note: the last two examples enumerate rows in all tables (not just the first one) inside #question element.

Is there an easy way to parse an HTML document and remove everything except a particular table?

Question

3 answers

solution1
1 2012-08-13 17:58:29

solution2
1 2012-08-13 18:55:59

solution3
1 ACCPTED 2012-08-13 19:42:17

Is there an easy way to parse an HTML document and remove everything except a particular table?

Question

3 answers

solution1 1 2012-08-13 17:58:29

solution2 1 2012-08-13 18:55:59

solution3 1 ACCPTED 2012-08-13 19:42:17

solution1
1 2012-08-13 17:58:29

solution2
1 2012-08-13 18:55:59

solution3
1 ACCPTED 2012-08-13 19:42:17