简体繁体中英

Html Parsing vs. Regex

原文 2014-10-11 20:08:14 9 1 python/ html/ regex/ scala/ html-parsing

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. The problem conditions below

Data is clear and well structured
Data is small
Performance matters, application must be able to get as much as data that is possibble
Application will write data to MongoDB database
Implementation programming language will be Scala or Python

Any opinion is valuable so what should I do?

1 answers

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format.

If performance matters here, there is a blazingly fast lxml package. For the HTML, use lxml.html .

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood . Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document .

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML:

RegEx match open tags except XHTML self-contained tags

HTML Parsing with Python (HTML vs. complete website)

Node.js vs. Python for parsing HTML

Efficient regex parsing of html

Regex/Beautifulsoup HTML parsing

Python Regex - Parsing HTML

Argument parsing in Python (required vs. optional)

Templates vs. coded HTML

regex tokenizer period vs. ellipsis

Finding links fast: regex vs. lxml

Using decode() vs. regex to unescape this string

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question HTML Parsing with Python (HTML vs. complete website) Node.js vs. Python for parsing HTML Efficient regex parsing of html Regex/Beautifulsoup HTML parsing Python Regex - Parsing HTML Argument parsing in Python (required vs. optional) Templates vs. coded HTML regex tokenizer period vs. ellipsis Finding links fast: regex vs. lxml Using decode() vs. regex to unescape this string

Related Tags

Html Parsing vs. Regex

Question

1 answers

solution1 4 ACCPTED 2014-10-11 20:15:37

solution1
4 ACCPTED 2014-10-11 20:15:37