[英]Html Parsing vs. Regex
I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. 我有一个固定的结构良好的html源,传入的数据清晰细小,仅包含一些div列表。 I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. 我知道使用html解析器进行html解析,但这看起来很特殊,我不确定应该使用哪个解析器。 The problem conditions below 下面的问题条件
Any opinion is valuable so what should I do? 任何意见都是有价值的,那我该怎么办?
I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format. 我仍然会坚持使用HTML解析器,因为至少有一个特定的数据格式和一个了解该格式的专用工具。
If performance matters here, there is a blazingly fast lxml
package. 如果这里的性能很重要,那么会有一个非常快的lxml
包。 For the HTML, use lxml.html
. 对于HTML,请使用lxml.html
。
You can also use an awesome BeautifulSoup
package and let it use lxml
parser under-the-hood . 您还可以使用一个很棒的BeautifulSoup
软件包,并在lxml
使用lxml
解析器 。 Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup
to parse only the relevant part of the HTML document, see more at: Parsing only part of a document . 此外,如果您需要解析的数据在HTML文档的特定部分中,则可以通过请求BeautifulSoup
仅解析HTML文档的相关部分来提高性能,请参见: 仅解析文档的一部分 。
And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML: 并且,为了遵循HTML + regex线程的传统,这里是对著名主题的引用,涵盖了您不应该使用regex解析HTML的原因:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.