简体   繁体   中英

Is there an easy way to parse an HTML document and remove everything except a particular table?

I'm trying to patch together a quick utility that will read records from a website's table and insert them into a database. There are a few conditions:

  1. The page's source is messy . Lots of CSS and Javascript thrown about. (It's an internal site.)
  2. I know the ID of the table I want.
  3. Once I have the table, I have to parse the rows further to get the specific informtation I'm looking for.
  4. This has to be done server-side. (Preferably java, python, or C++, although if there is another particularly good option, that's fine too.)

So does anyone know of a good library or utility that can "grab" an html element by ID and let me parse it? I know it goes without saying, but I'd prefer one that's as quick as possible.

jQuery can select an element by it's Id.

See use-jquery-to-extract-data-from-html-lists-and-tables

The gist of the article is:

var tableObject = $('#myTable tbody tr').map(function(i) {
  var row = {};

  // Find all of the table cells on this row.
  $(this).find('td').each(function(i) { //do something with each td }

You could use a regular expression:

<table[^>]*id="whatever"[^>]*>(.*?)</table>

Then extract the first group (the part of the match in parens) and parse out the rows:

<tr[^>]*>(.*?)</tr>

Finally, with each row, extract the cells:

<td[^>]*>(.*?)</td>

This would work in any of the languages you mentioned.

You could use lxml library in Python:

#!/usr/bin/env python
import urllib2
from lxml import html # $ apt-get install python-lxml or $ pip install lxml

page = urllib2.urlopen('http://stackoverflow.com/q/11939631')
doc = html.parse(page).getroot()

div = doc.get_element_by_id('question')
for tr in div.find('table').iterchildren('tr'):
    for td in tr.iterchildren('td'):
        print(td.text_content()) # process td

If you are familiar with jQuery; you could use pyquery . It adds jQuery interface on top of lxml:

#!/usr/bin/env python
from pyquery import PyQuery # $ apt-get install python-pyquery or
                            # $ pip install pyquery

# d is like the $ in jquery
d = PyQuery(url='http://stackoverflow.com/q/11939631', parser='html')
for tr in d("#question table > tr"):
    for td in tr.iterchildren('td'):
        print(td.text_content())

Though in this case pyquery doesn't add enough. Here's the same using only lxml :

#!/usr/bin/env python
import urllib2
from lxml import html

page = urllib2.urlopen('http://stackoverflow.com/q/11939631')
doc = html.parse(page).getroot()
for tr in doc.cssselect('#question table > tr'):
    for td in tr.iterchildren('td'):
        print(td.text_content()) # process td

Note: the last two examples enumerate rows in all tables (not just the first one) inside #question element.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM