简体   繁体   中英

Html Parsing vs. Regex

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. The problem conditions below

  • Data is clear and well structured
  • Data is small
  • Performance matters, application must be able to get as much as data that is possibble
  • Application will write data to MongoDB database
  • Implementation programming language will be Scala or Python

Any opinion is valuable so what should I do?

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format.

If performance matters here, there is a blazingly fast lxml package. For the HTML, use lxml.html .

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood . Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document .

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM