简体   繁体   English

如何解析这段HTML?

[英]How to parse this piece of HTML?

good morning! 早上好! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex: 我正在使用c#(框架3.5sp1),并希望通过正则表达式解析以下html片段:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output: 我需要以下输出:

  • group 1: content of h1 第1组:h1的含量
  • group 2: content of h1-following text 第2组:h1后续文本的内容
  • group 3-n: content of subcaptions + text group 3-n:子标题+文本的内容

what i have atm: 我有什么atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/> . 由于尾随的<hr/> ,这将给我每个奇数的子标题+内容(例如,1,3,...)。 for parsing the h1-caption i have another pattern ( <h1.*?>(.*?)</h1> ), which only gives me the caption but not the content - i'm fine with that atm. 为了解析h1-caption我有另一个模式( <h1.*?>(.*?)</h1> ),它只给我标题而不是内容 - 我对那个atm很好。

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)? 有没有人对我或任何其他逻辑有一个提示/解决方案(例如通过阅读器解析html并以这种方式分配?)?

edit: 编辑:
as some brought in HTMLAgilityPack , i was curious about this nice tool. 正如一些人带来的HTMLAgilityPack ,我很好奇这个漂亮的工具。 i accomplished getting content of the <h1> -tag. 我完成了<h1> -tag的内容。
but ... myproblem is parsing the rest. 但是...我的问题是解析其余部分。 this is caused by: the tags for the content may vary - from <p> to <div> and <ul> ... atm this seems more or less iterate over the whole document and parsing tag for tag ...? 这是由于:内容的标签可能会有所不同 - 从<p><div><ul> ... atm这似乎或多或少地遍历整个文档并解析标签的标签......? any hints? 任何提示?

你真的需要HTML解析器

Don't use regex to parse HTML. 不要使用正则表达式来解析HTML。 Consider using the HTML Agility Pack . 考虑使用HTML Agility Pack

There are some possibilities: 有一些可能性:

REGEX - Fast but not reliable, it cant deal with malformed html. REGEX - 快速但不可靠,它无法处理格式错误的HTML。

HtmlAgilityPack - Good, but have many memory leaks. HtmlAgilityPack - 很好,但有很多内存泄漏。 If you want to deal with a few files, there is no problem. 如果你想处理几个文件,没有问题。

SGMLReader - Really good, but there are a problem. SGMLReader - 真的很好,但是有一个问题。 Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html. 有时它无法找到默认命名空间来获取其他节点,因此无法解析html。

http://developer.mindtouch.com/SgmlReader http://developer.mindtouch.com/SgmlReader

Majestic-12 - Good but not so fast as SGMLReader. Majestic-12 - 好但不如SGMLReader快。

http://www.majestic12.co.uk/projects/html_parser.php http://www.majestic12.co.uk/projects/html_parser.php

Example for SGMLreader (VB.net) SGMLreader(VB.net)的示例

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

In Majestic-12 is different, you have to walk to every tag with a "Next" command. 在Majestic-12中,您必须使用“下一步”命令走到每个标签。 You can find a example code with the dll. 您可以使用dll找到示例代码。

As others have mentioned, use the HtmlAgilityPack. 正如其他人提到的那样,使用HtmlAgilityPack。 However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler: http://code.google.com/p/fizzler/ Using this you could find all <p> tags using: 但是,如果你喜欢jQuery / CSS选择器,我只是找到了一个名为Fizzler的HtmlAgilityPack的分支: http//code.google.com/p/fizzler/使用它你可以找到所有<p>标签:

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div> : 或者找到一个特定的div,如<div id="myDiv"></div>

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It can't get any easier than that! 它不能比这更容易!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM