简体   繁体   English

具有HTML字符串的C#正则表达式

[英]C# regular expressions with HTML strings

I'm working on a small assignment that requires the use of regular expressions with HTML strings. 我正在做一个小任务,它需要对HTML字符串使用正则表达式。 My current problem is properly obtaining strings enclosed within HTML tags. 我当前的问题是正确获取HTML标记内包含的字符串。

For instance: 例如:

I have a string 我有一个弦

<p>&lt;Placeholder&gt;</p>

I've been able to obtain the contents with the following regex 我已经可以使用以下正则表达式获取内容

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?>|^<.*?><.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

Which would return: 哪个会返回:

<Placeholder>

However, should the string contain an additional HTML tag, eg: 但是,字符串应包含其他HTML标记,例如:

<p><strong>Placeholder</strong></p>

I would get this 我会得到这个

<strong>Placeholder 

It appears I'm only able to successfully remove the closing tag(s), but I can't do the same with the opening tag(s). 看来我只能成功删除结束标记,但不能对开始标记执行相同的操作。 Could anybody tell me where I've gone wrong? 谁能告诉我我哪里出问题了?

EDIT: 编辑:

To summarize, is there a way for me to treat the string enclosed within HTML tags as literal? 总而言之,我是否有办法将HTML标记中包含的字符串视为原义? To cover the possibility that the string could contain special characters (eg > <) 为了解决字符串可能包含特殊字符(例如> <)的可能性

I am not sure if your will get happy with your regex usage on html, but I want to explain what the problem for your "mis"match is: 我不确定您是否会对html上的正则表达式感到满意,但我想解释一下“不匹配”匹配的问题是什么:

An alternation will use the first match it will find and will not look for further matches. 轮换将使用找到的第一个匹配,不再寻找其他匹配。 So when you search at the start for 因此,当您开始搜索

^<.*?>|^<.*?><.*?>

on the string 在弦上

<p><strong>Placeholder</strong></p>

It will match on the first alternative and therefore it will end with a successful match on the first alternative. 它将在第一个替代方案上进行匹配,因此将在第一个替代方案上获得成功的匹配。 So if you want to match <p><strong> at the start you should change the ordering in the alternation. 因此,如果您想一开始就匹配<p><strong> ,则应更改交替的顺序。 but only for the part at the start of the string, for the end of the string your ordering is fine. 但仅对于字符串开头的部分,对于字符串的结尾,您的顺序就可以了。

So for your example this would work: 因此对于您的示例,这将起作用:

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?><.*?>|^<.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

==> The ordering inside an alternation can be important ==>交替内的顺序可能很重要

An alternative would be to use a quantifier instead of an alternation: 一种替代方法是使用量词而不是交替词:

string dec = Regex.Replace(s, "^(?:<.*?>)+", "");
return Regex.Replace(dec, "(?:</.*?>)+$", "");

this would work also for more than 2 tags. 这也适用于2个以上的标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM