简体   繁体   English

正则表达式获取标签

[英]Regex to get the tags

I have a html like this : 我有这样的html:

<h1> Headhing </h>
<font name="arial">some text</font></br>
some other text

In C#, I want to get the out put as below. 在C#中,我要按如下所示进行输出。 Simply content inside the font start tag and end tag 只需在字体开始标签和结束标签中包含内容

<font name="arial">some text</font>

I wouldn't recommend to try it with regex. 我不建议与正则表达式一起尝试。

I use the HTML Agility Pack to parse HTML and get what I want. 我使用HTML Agility Pack解析HTML并得到我想要的。 It's a lovely HTML parser that is commonly recommended for this. 这是一个可爱的HTML解析器,通常为此建议使用。 It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. 它将采用格式错误的HTML,并将其压缩为XHTML,然后处理为可遍历的DOM,例如XML类。 So, is very useful for the code you find in the wild. 因此,对于您在野外找到的代码非常有用。

There's also an HTML parser from Microsoft MSHTML but I haven't tried it. 还有来自Microsoft MSHTML的HTML解析器,但我还没有尝试过。

First off, your html is wrong. 首先,您的html是错误的。 you should close a <h1> with a </h1> not </h> . 您应该用</h1>而不是</h>关闭<h1> </h> This one thing is why reg ex is inappropriate to parse tags. 这是正则表达式不适合解析标签的原因。

Second, there are hundreds of questions on SO talking about parsing html with regex. 其次,关于SO与正则表达式解析html有数百个问题。 The answer is don't. 答案是否定的。 Use something like the html agility pack. 使用类似html敏捷包的工具。

 Regex regExfont = new Regex(@"<font name=""arial""[^>]*>.*</font>");
 MatchCollection rows = regExfont.Matches(string);

good website is http://www.regexlib.com/RETester.aspx 好的网站是http://www.regexlib.com/RETester.aspx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM