简体繁体 English

HTML解析与Regex

[英]Html Parsing vs. Regex

原文 2014-10-11 20:08:14 9 1 python/ html/ regex/ scala/ html-parsing

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. 我有一个固定的结构良好的html源，传入的数据清晰细小，仅包含一些div列表。 I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. 我知道使用html解析器进行html解析，但这看起来很特殊，我不确定应该使用哪个解析器。 The problem conditions below 下面的问题条件

Data is clear and well structured 数据清晰且结构合理
Data is small 数据小
Performance matters, application must be able to get as much as data that is possibble 性能很重要，应用程序必须能够获取尽可能多的数据
Application will write data to MongoDB database 应用程序将数据写入MongoDB数据库
Implementation programming language will be Scala or Python 实现编程语言将为Scala或Python

Any opinion is valuable so what should I do? 任何意见都是有价值的，那我该怎么办？

1 个解决方案

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format. 我仍然会坚持使用HTML解析器，因为至少有一个特定的数据格式和一个了解该格式的专用工具。

If performance matters here, there is a blazingly fast lxml package. 如果这里的性能很重要，那么会有一个非常快的lxml包。 For the HTML, use lxml.html . 对于HTML，请使用lxml.html 。

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood . 您还可以使用一个很棒的BeautifulSoup软件包，并在lxml使用lxml解析器。 Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document . 此外，如果您需要解析的数据在HTML文档的特定部分中，则可以通过请求BeautifulSoup仅解析HTML文档的相关部分来提高性能，请参见：仅解析文档的一部分。

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML: 并且，为了遵循HTML + regex线程的传统，这里是对著名主题的引用，涵盖了您不应该使用regex解析HTML的原因：