简体   繁体   English

如何在C#中使用正则表达式解析HTML?

[英]How do I parse HTML using regular expressions in C#?

How do I parse HTML using regular expressions in C#? 如何在C#中使用正则表达式解析HTML?

For example, given HTML code 例如,给定的HTML代码

<s2> t1 </s2>  <img src='1.gif' />  <span> span1 <span/>

I am trying to obtain 我正在尝试获得

1.  <s2>
2.  t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>

How do I do this using regular expressions in C#? 如何在C#中使用正则表达式来做到这一点?

In my case, the HTML input is not well-formed XML like XHTML. 就我而言,HTML输入不是像XHTML那样格式正确的XML。 Therefore I can not use XML parsers to do this. 因此,我不能使用XML解析器来做到这一点。

Regular expressions are a very poor way to parse HTML. 正则表达式是解析HTML的非常差的方法。 If you can guarantee that your input will be well-formed XML (ie XHTML), you can use XmlReader to read the elements and then print them out however you like. 如果可以保证您的输入将采用格式正确的XML(即XHTML),则可以使用XmlReader读取元素,然后根据需要将其打印出来。

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. 从字面上看,这已经被回答了数十次,但是需要重复:正则表达式只能解析正则语言,这就是为什么它们被称为正则表达式的原因。 HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions. HTML不是一种常规语言(因为过去十年中每个大学生都可能至少被证明过一次),因此无法用常规表达式进行解析。

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack . 您可能要尝试Html Agility Pack, http://www.codeplex.com/htmlagilitypack It even handles malformed HTML. 它甚至可以处理格式错误的HTML。

I used this regx in C#, and it works. 我在C#中使用了此regx,并且可以正常工作。 Thanks for all your answers. 感谢您的所有答复。

<([^<]*)>|([^<]*)

you might want to simply use string functions. 您可能只想使用字符串函数。 make < and > as your indicator for parsing. 使<和>作为您的分析指标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM