[英]How to parse this piece of HTML?
good morning! 早上好! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex: 我正在使用c#(框架3.5sp1),并希望通过正则表达式解析以下html片段:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
i need following output: 我需要以下输出:
what i have atm: 我有什么atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>
. 由于尾随的<hr/>
,这将给我每个奇数的子标题+内容(例如,1,3,...)。 for parsing the h1-caption i have another pattern ( <h1.*?>(.*?)</h1>
), which only gives me the caption but not the content - i'm fine with that atm. 为了解析h1-caption我有另一个模式( <h1.*?>(.*?)</h1>
),它只给我标题而不是内容 - 我对那个atm很好。
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)? 有没有人对我或任何其他逻辑有一个提示/解决方案(例如通过阅读器解析html并以这种方式分配?)?
edit: 编辑:
as some brought in HTMLAgilityPack , i was curious about this nice tool. 正如一些人带来的HTMLAgilityPack ,我很好奇这个漂亮的工具。 i accomplished getting content of the <h1>
-tag. 我完成了<h1>
-tag的内容。
but ... myproblem is parsing the rest. 但是...我的问题是解析其余部分。 this is caused by: the tags for the content may vary - from <p>
to <div>
and <ul>
... atm this seems more or less iterate over the whole document and parsing tag for tag ...? 这是由于:内容的标签可能会有所不同 - 从<p>
到<div>
和<ul>
... atm这似乎或多或少地遍历整个文档并解析标签的标签......? any hints? 任何提示?
你真的需要HTML解析器
Don't use regex to parse HTML. 不要使用正则表达式来解析HTML。 Consider using the HTML Agility Pack . 考虑使用HTML Agility Pack 。
There are some possibilities: 有一些可能性:
REGEX - Fast but not reliable, it cant deal with malformed html. REGEX - 快速但不可靠,它无法处理格式错误的HTML。
HtmlAgilityPack - Good, but have many memory leaks. HtmlAgilityPack - 很好,但有很多内存泄漏。 If you want to deal with a few files, there is no problem. 如果你想处理几个文件,没有问题。
SGMLReader - Really good, but there are a problem. SGMLReader - 真的很好,但是有一个问题。 Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html. 有时它无法找到默认命名空间来获取其他节点,因此无法解析html。
http://developer.mindtouch.com/SgmlReader http://developer.mindtouch.com/SgmlReader
Majestic-12 - Good but not so fast as SGMLReader. Majestic-12 - 好但不如SGMLReader快。
http://www.majestic12.co.uk/projects/html_parser.php http://www.majestic12.co.uk/projects/html_parser.php
Example for SGMLreader (VB.net) SGMLreader(VB.net)的示例
Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)
Dim XNS As XNamespace
' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
XNS = htmldoc.Root.GetDefaultNamespace
Catch
XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
XNS = "http://www.w3.org/1999/xhtml"
End If
'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
Scripts &= link.Value
Next
In Majestic-12 is different, you have to walk to every tag with a "Next" command. 在Majestic-12中,您必须使用“下一步”命令走到每个标签。 You can find a example code with the dll. 您可以使用dll找到示例代码。
As others have mentioned, use the HtmlAgilityPack. 正如其他人提到的那样,使用HtmlAgilityPack。 However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler: http://code.google.com/p/fizzler/ Using this you could find all <p>
tags using: 但是,如果你喜欢jQuery / CSS选择器,我只是找到了一个名为Fizzler的HtmlAgilityPack的分支: http : //code.google.com/p/fizzler/使用它你可以找到所有<p>
标签:
var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();
Or find a specific div like <div id="myDiv"></div>
: 或者找到一个特定的div,如<div id="myDiv"></div>
:
var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');
It can't get any easier than that! 它不能比这更容易!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.