具有HTML字符串的C＃正则表达式

Question

I'm working on a small assignment that requires the use of regular expressions with HTML strings. 我正在做一个小任务，它需要对HTML字符串使用正则表达式。 My current problem is properly obtaining strings enclosed within HTML tags. 我当前的问题是正确获取HTML标记内包含的字符串。

For instance: 例如：

I have a string 我有一个弦

<p>&lt;Placeholder&gt;</p>

I've been able to obtain the contents with the following regex 我已经可以使用以下正则表达式获取内容

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?>|^<.*?><.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

Which would return: 哪个会返回：

<Placeholder>

However, should the string contain an additional HTML tag, eg: 但是，字符串应包含其他HTML标记，例如：

<p><strong>Placeholder</strong></p>

I would get this 我会得到这个

<strong>Placeholder

It appears I'm only able to successfully remove the closing tag(s), but I can't do the same with the opening tag(s). 看来我只能成功删除结束标记，但不能对开始标记执行相同的操作。 Could anybody tell me where I've gone wrong? 谁能告诉我我哪里出问题了？

EDIT: 编辑：

To summarize, is there a way for me to treat the string enclosed within HTML tags as literal? 总而言之，我是否有办法将HTML标记中包含的字符串视为原义？ To cover the possibility that the string could contain special characters (eg > <) 为了解决字符串可能包含特殊字符（例如> <）的可能性

Answer 1

I am not sure if your will get happy with your regex usage on html, but I want to explain what the problem for your "mis"match is: 我不确定您是否会对html上的正则表达式感到满意，但我想解释一下“不匹配”匹配的问题是什么：

An alternation will use the first match it will find and will not look for further matches. 轮换将使用找到的第一个匹配，不再寻找其他匹配。 So when you search at the start for 因此，当您开始搜索

^<.*?>|^<.*?><.*?>

on the string 在弦上

<p><strong>Placeholder</strong></p>

It will match on the first alternative and therefore it will end with a successful match on the first alternative. 它将在第一个替代方案上进行匹配，因此将在第一个替代方案上获得成功的匹配。 So if you want to match <p><strong> at the start you should change the ordering in the alternation. 因此，如果您想一开始就匹配<p><strong> ，则应更改交替的顺序。 but only for the part at the start of the string, for the end of the string your ordering is fine. 但仅对于字符串开头的部分，对于字符串的结尾，您的顺序就可以了。

So for your example this would work: 因此对于您的示例，这将起作用：

private string Unescape(){
    string s = WebUtility.HtmlDecode("<p>&lt;Placeholder&gt;</p>");
    string dec = Regex.Replace(s, "^<.*?><.*?>|^<.*?>", "");
    return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}

==> The ordering inside an alternation can be important ==>交替内的顺序可能很重要

An alternative would be to use a quantifier instead of an alternation: 一种替代方法是使用量词而不是交替词：

string dec = Regex.Replace(s, "^(?:<.*?>)+", "");
return Regex.Replace(dec, "(?:</.*?>)+$", "");

this would work also for more than 2 tags. 这也适用于2个以上的标签。

具有HTML字符串的C＃正则表达式

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-10-09 07:57:34

具有HTML字符串的C＃正则表达式

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-10-09 07:57:34

解决方案1
1 已采纳 2012-10-09 07:57:34