I'm trying to patch together a quick utility that will read records from a website's table and insert them into a database. There are a few conditions:
So does anyone know of a good library or utility that can "grab" an html element by ID and let me parse it? I know it goes without saying, but I'd prefer one that's as quick as possible.
jQuery can select an element by it's Id.
See use-jquery-to-extract-data-from-html-lists-and-tables
The gist of the article is:
var tableObject = $('#myTable tbody tr').map(function(i) {
var row = {};
// Find all of the table cells on this row.
$(this).find('td').each(function(i) { //do something with each td }
You could use a regular expression:
<table[^>]*id="whatever"[^>]*>(.*?)</table>
Then extract the first group (the part of the match in parens) and parse out the rows:
<tr[^>]*>(.*?)</tr>
Finally, with each row, extract the cells:
<td[^>]*>(.*?)</td>
This would work in any of the languages you mentioned.
You could use lxml
library in Python:
#!/usr/bin/env python
import urllib2
from lxml import html # $ apt-get install python-lxml or $ pip install lxml
page = urllib2.urlopen('http://stackoverflow.com/q/11939631')
doc = html.parse(page).getroot()
div = doc.get_element_by_id('question')
for tr in div.find('table').iterchildren('tr'):
for td in tr.iterchildren('td'):
print(td.text_content()) # process td
If you are familiar with jQuery; you could use pyquery . It adds jQuery interface on top of lxml:
#!/usr/bin/env python
from pyquery import PyQuery # $ apt-get install python-pyquery or
# $ pip install pyquery
# d is like the $ in jquery
d = PyQuery(url='http://stackoverflow.com/q/11939631', parser='html')
for tr in d("#question table > tr"):
for td in tr.iterchildren('td'):
print(td.text_content())
Though in this case pyquery
doesn't add enough. Here's the same using only lxml
:
#!/usr/bin/env python
import urllib2
from lxml import html
page = urllib2.urlopen('http://stackoverflow.com/q/11939631')
doc = html.parse(page).getroot()
for tr in doc.cssselect('#question table > tr'):
for td in tr.iterchildren('td'):
print(td.text_content()) # process td
Note: the last two examples enumerate rows in all tables (not just the first one) inside #question
element.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.