简体   繁体   中英

Fastest, easiest, and best way to parse an HTML table?

I'm trying to get this table http://www.datamystic.com/timezone/time_zones.html into array format so I can do whatever I want with it. Preferably in PHP, python or JavaScript.

This is the kind of problem that comes up a lot, so rather than looking for help with this specific problem, I'm looking for ideas on how to solve all similar problems.

BeautifulSoup is the first thing that comes to mind. Another possibility is copying/pasting it in TextMate and then running regular expressions.

What do you suggest?

This is the script that I ended up writing, but as I said, I'm looking for a more general solution.

from BeautifulSoup import BeautifulSoup
import urllib2


url = 'http://www.datamystic.com/timezone/time_zones.html';
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll("table")
table = tables[1]
rows = table.findAll("tr")
for row in rows:
    tds = row.findAll('td')
    if(len(tds)==4):
        countrycode = tds[1].string
        timezone = tds[2].string
        if(type(countrycode) is not type(None) and type(timezone) is not type(None)):
            print "\'%s\' => \'%s\'," % (countrycode.strip(), timezone.strip())

Comments and suggestions for improvement to my python code welcome, too ;)

For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)

A quick example for your specific case:

from lxml import html

tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
           [td.text_content().strip() for td in row.findall('td')] 
           for row in table.findall('tr')
       ]

This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)

BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil .

Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...

A few other alternatives

All of these are reasonably tolerant of poorly formed HTML.

I suggest loading the document with an XML parser like DOMDocument::loadHTMLFile that is bundled with PHP and then use XPath to grep the data you need.

This is not the fastest way, but the most readable (in my opinion) in the end. You can use Regex, which will probably be a little faster, but would be bad style (hard to debug, hard to read).

EDIT: Actually this is hard because the page you mentioned is not valid HTML (see validator.w3.org). Especially tags with no opening/closing tag are heavily in the way.

It looks though like xmlstarlet ( http://xmlstar.sourceforge.net/ (great tool)) is able to repair the problem (run xmlstarlet fo -R ). xmlstarlet can also do xpath and xslt script which can help you in extracting your data with a simple shell script.

While we were building SerpAPI we tested many platform/parser.

Here is the benchmark result for Python.

python解析器基准

For more, here is a full article on Medium: https://medium.com/@vikoky/fastest-html-parser-available-now-f677a68b81dd

The efficiency of a regex is superior to a DOM parser.

Look at this comparison:

http://www.rockto.com/launcher/28852/mochien.com/Blog/Read/A300111001736/Regex-VS-DOM-untuk-Rockto-Team

You can find many more searching the web.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM