如何在C＃中使用正则表达式解析HTML？

Question

How do I parse HTML using regular expressions in C#? 如何在C＃中使用正则表达式解析HTML？

For example, given HTML code 例如，给定的HTML代码

<s2> t1 </s2>  <img src='1.gif' />  <span> span1 <span/>

I am trying to obtain 我正在尝试获得

1.  <s2>
2.  t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>

How do I do this using regular expressions in C#? 如何在C＃中使用正则表达式来做到这一点？

In my case, the HTML input is not well-formed XML like XHTML. 就我而言，HTML输入不是像XHTML那样格式正确的XML。 Therefore I can not use XML parsers to do this. 因此，我不能使用XML解析器来做到这一点。

Answer 1

Regular expressions are a very poor way to parse HTML. 正则表达式是解析HTML的非常差的方法。 If you can guarantee that your input will be well-formed XML (ie XHTML), you can use XmlReader to read the elements and then print them out however you like. 如果可以保证您的输入将采用格式正确的XML（即XHTML），则可以使用XmlReader读取元素，然后根据需要将其打印出来。

Answer 2

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. 从字面上看，这已经被回答了数十次，但是需要重复：正则表达式只能解析正则语言，这就是为什么它们被称为正则表达式的原因。 HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions. HTML不是一种常规语言（因为过去十年中每个大学生都可能至少被证明过一次），因此无法用常规表达式进行解析。

Answer 3

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack . 您可能要尝试Html Agility Pack， http：//www.codeplex.com/htmlagilitypack 。 It even handles malformed HTML. 它甚至可以处理格式错误的HTML。

Answer 4

I used this regx in C#, and it works. 我在C＃中使用了此regx，并且可以正常工作。 Thanks for all your answers. 感谢您的所有答复。

<([^<]*)>|([^<]*)

Answer 5

you might want to simply use string functions. 您可能只想使用字符串函数。 make < and > as your indicator for parsing. 使<和>作为您的分析指标。

如何在C＃中使用正则表达式解析HTML？

问题描述

5 个解决方案

解决方案1
6 2009-10-15 01:57:00

解决方案2
4 2009-10-15 02:36:56

解决方案3
3 2009-10-15 02:12:52

解决方案4
0 已采纳 2009-10-15 03:05:06

解决方案5
-3 2009-10-15 02:33:43

如何在C＃中使用正则表达式解析HTML？

问题描述

5 个解决方案

解决方案1 6 2009-10-15 01:57:00

解决方案2 4 2009-10-15 02:36:56

解决方案3 3 2009-10-15 02:12:52

解决方案4 0 已采纳 2009-10-15 03:05:06

解决方案5 -3 2009-10-15 02:33:43

解决方案1
6 2009-10-15 01:57:00

解决方案2
4 2009-10-15 02:36:56

解决方案3
3 2009-10-15 02:12:52

解决方案4
0 已采纳 2009-10-15 03:05:06

解决方案5
-3 2009-10-15 02:33:43