简体繁体 English

如何解析不良的HTML？

[英]How to parse bad html?

原文 2012-05-23 13:28:52 9 3 c#/ html/ regex

I am writing a search engine that goes to all my company affiliates websites parse html and stores them in database. 我正在编写一个搜索引擎，该引擎将访问我所有公司分支机构的网站以解析html并将其存储在数据库中。 These websites are really old and are not html compliant out of 100000 websites around 25% have bad html that makes it difficult to parse. 这些网站确实很旧，在100000个网站中不符合html规范，其中约25％的html不好，很难解析。 I need to write ac# code that might fix bad html and then parse the contents or come up with a solution that will address above said issue. 我需要编写可能会修复错误html的ac＃代码，然后解析内容或提出解决上述问题的解决方案。 If you are sitting on idea, an actual hint or code snippet would help. 如果您有想法，那么实际的提示或代码段会有所帮助。

3 个解决方案

Just use Html Agility Pack . 只需使用HTML Agility Pack 。 It is the very good to parse faulty html code 解析错误的html代码非常好

People generally use some form of heuristic-driven tag soup parser. 人们通常使用某种形式的启发式标签汤解析器。

Eg for 例如

Java 爪哇
Haskell 哈斯克尔

These are mostly just lexers, that try their best to build an AST from all the random symbols. 这些大多只是词法分析器，它们会尽力从所有随机符号中构建AST。

Use a tagsoup parser, I'm sure the is one for C# . 使用tagoup解析器，我确定C＃是一个。 Then you can serialize the DOM to a more-or less valid HTML, depending on whether that parser conforms to the HTML DTD. 然后，您可以将DOM序列化为或多或少有效的HTML，具体取决于该解析器是否符合HTML DTD。 Alternatively you can use HTML Tidy , which will clear at least the worst faults. 另外，您可以使用HTML Tidy ，它至少可以清除最严重的错误。