如何解析这段HTML？

Question

good morning! 早上好！ i am using c# (framework 3.5sp1) and want to parse following piece of html via regex: 我正在使用c＃（框架3.5sp1），并希望通过正则表达式解析以下html片段：

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output: 我需要以下输出：

group 1: content of h1 第1组：h1的含量
group 2: content of h1-following text 第2组：h1后续文本的内容
group 3-n: content of subcaptions + text group 3-n：子标题+文本的内容

what i have atm: 我有什么atm：

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/> . 由于尾随的<hr/> ，这将给我每个奇数的子标题+内容（例如，1,3，...）。 for parsing the h1-caption i have another pattern ( <h1.*?>(.*?)</h1> ), which only gives me the caption but not the content - i'm fine with that atm. 为了解析h1-caption我有另一个模式（ <h1.*?>(.*?)</h1> ），它只给我标题而不是内容 - 我对那个atm很好。

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)? 有没有人对我或任何其他逻辑有一个提示/解决方案（例如通过阅读器解析html并以这种方式分配？）？

edit: 编辑：
as some brought in HTMLAgilityPack , i was curious about this nice tool. 正如一些人带来的HTMLAgilityPack ，我很好奇这个漂亮的工具。 i accomplished getting content of the <h1> -tag. 我完成了<h1> -tag的内容。
but ... myproblem is parsing the rest. 但是...我的问题是解析其余部分。 this is caused by: the tags for the content may vary - from <p> to <div> and <ul> ... atm this seems more or less iterate over the whole document and parsing tag for tag ...? 这是由于：内容的标签可能会有所不同 - 从<p>到<div>和<ul> ... atm这似乎或多或少地遍历整个文档并解析标签的标签......？ any hints? 任何提示？

Answer 1

你真的需要HTML解析器

Answer 2

Don't use regex to parse HTML. 不要使用正则表达式来解析HTML。 Consider using the HTML Agility Pack . 考虑使用HTML Agility Pack 。

Answer 3

There are some possibilities: 有一些可能性：

REGEX - Fast but not reliable, it cant deal with malformed html. REGEX - 快速但不可靠，它无法处理格式错误的HTML。

HtmlAgilityPack - Good, but have many memory leaks. HtmlAgilityPack - 很好，但有很多内存泄漏。 If you want to deal with a few files, there is no problem. 如果你想处理几个文件，没有问题。

SGMLReader - Really good, but there are a problem. SGMLReader - 真的很好，但是有一个问题。 Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html. 有时它无法找到默认命名空间来获取其他节点，因此无法解析html。

http://developer.mindtouch.com/SgmlReader http://developer.mindtouch.com/SgmlReader

Majestic-12 - Good but not so fast as SGMLReader. Majestic-12 - 好但不如SGMLReader快。

http://www.majestic12.co.uk/projects/html_parser.php http://www.majestic12.co.uk/projects/html_parser.php

Example for SGMLreader (VB.net) SGMLreader（VB.net）的示例

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

In Majestic-12 is different, you have to walk to every tag with a "Next" command. 在Majestic-12中，您必须使用“下一步”命令走到每个标签。 You can find a example code with the dll. 您可以使用dll找到示例代码。

Answer 4

As others have mentioned, use the HtmlAgilityPack. 正如其他人提到的那样，使用HtmlAgilityPack。 However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler: http://code.google.com/p/fizzler/ Using this you could find all <p> tags using: 但是，如果你喜欢jQuery / CSS选择器，我只是找到了一个名为Fizzler的HtmlAgilityPack的分支： http ： //code.google.com/p/fizzler/使用它你可以找到所有<p>标签：

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div> : 或者找到一个特定的div，如<div id="myDiv"></div> ：

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It can't get any easier than that! 它不能比这更容易！

如何解析这段HTML？

问题描述

4 个解决方案

解决方案1
9 已采纳 2010-01-19 06:51:31

解决方案2
6 2010-01-19 06:51:44

解决方案3
2 2011-12-19 13:29:59

解决方案4
1 2012-01-19 16:58:15

如何解析这段HTML？

问题描述

4 个解决方案

解决方案1 9 已采纳 2010-01-19 06:51:31

解决方案2 6 2010-01-19 06:51:44

解决方案3 2 2011-12-19 13:29:59

解决方案4 1 2012-01-19 16:58:15

解决方案1
9 已采纳 2010-01-19 06:51:31

解决方案2
6 2010-01-19 06:51:44

解决方案3
2 2011-12-19 13:29:59

解决方案4
1 2012-01-19 16:58:15