简体   繁体   English

HTML解析与Regex

[英]Html Parsing vs. Regex

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. 我有一个固定的结构良好的html源,传入的数据清晰细小,仅包含一些div列表。 I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. 我知道使用html解析器进行html解析,但这看起来很特殊,我不确定应该使用哪个解析器。 The problem conditions below 下面的问题条件

  • Data is clear and well structured 数据清晰且结构合理
  • Data is small 数据小
  • Performance matters, application must be able to get as much as data that is possibble 性能很重要,应用程序必须能够获取尽可能多的数据
  • Application will write data to MongoDB database 应用程序将数据写入MongoDB数据库
  • Implementation programming language will be Scala or Python 实现编程语言将为Scala或Python

Any opinion is valuable so what should I do? 任何意见都是有价值的,那我该怎么办?

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format. 我仍然会坚持使用HTML解析器,因为至少有一个特定的数据格式和一个了解该格式的专用工具。

If performance matters here, there is a blazingly fast lxml package. 如果这里的性能很重要,那么会有一个非常快的lxml包。 For the HTML, use lxml.html . 对于HTML,请使用lxml.html

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood . 您还可以使用一个很棒的BeautifulSoup软件包,并lxml使用lxml解析器 Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document . 此外,如果您需要解析的数据在HTML文档的特定部分中,则可以通过请求BeautifulSoup仅解析HTML文档的相关部分来提高性能,请参见: 仅解析文档的一部分

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML: 并且,为了遵循HTML + regex线程的传统,这里是对著名主题的引用,涵盖了您不应该使用regex解析HTML的原因:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM