[英]C# regular expressions with HTML strings
I'm working on a small assignment that requires the use of regular expressions with HTML strings. 我正在做一个小任务,它需要对HTML字符串使用正则表达式。 My current problem is properly obtaining strings enclosed within HTML tags. 我当前的问题是正确获取HTML标记内包含的字符串。
For instance: 例如:
I have a string 我有一个弦
<p><Placeholder></p>
I've been able to obtain the contents with the following regex 我已经可以使用以下正则表达式获取内容
private string Unescape(){
string s = WebUtility.HtmlDecode("<p><Placeholder></p>");
string dec = Regex.Replace(s, "^<.*?>|^<.*?><.*?>", "");
return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}
Which would return: 哪个会返回:
<Placeholder>
However, should the string contain an additional HTML tag, eg: 但是,字符串应包含其他HTML标记,例如:
<p><strong>Placeholder</strong></p>
I would get this 我会得到这个
<strong>Placeholder
It appears I'm only able to successfully remove the closing tag(s), but I can't do the same with the opening tag(s). 看来我只能成功删除结束标记,但不能对开始标记执行相同的操作。 Could anybody tell me where I've gone wrong? 谁能告诉我我哪里出问题了?
EDIT: 编辑:
To summarize, is there a way for me to treat the string enclosed within HTML tags as literal? 总而言之,我是否有办法将HTML标记中包含的字符串视为原义? To cover the possibility that the string could contain special characters (eg > <) 为了解决字符串可能包含特殊字符(例如> <)的可能性
I am not sure if your will get happy with your regex usage on html, but I want to explain what the problem for your "mis"match is: 我不确定您是否会对html上的正则表达式感到满意,但我想解释一下“不匹配”匹配的问题是什么:
An alternation will use the first match it will find and will not look for further matches. 轮换将使用找到的第一个匹配,不再寻找其他匹配。 So when you search at the start for 因此,当您开始搜索
^<.*?>|^<.*?><.*?>
on the string 在弦上
<p><strong>Placeholder</strong></p>
It will match on the first alternative and therefore it will end with a successful match on the first alternative. 它将在第一个替代方案上进行匹配,因此将在第一个替代方案上获得成功的匹配。 So if you want to match <p><strong>
at the start you should change the ordering in the alternation. 因此,如果您想一开始就匹配<p><strong>
,则应更改交替的顺序。 but only for the part at the start of the string, for the end of the string your ordering is fine. 但仅对于字符串开头的部分,对于字符串的结尾,您的顺序就可以了。
So for your example this would work: 因此对于您的示例,这将起作用:
private string Unescape(){
string s = WebUtility.HtmlDecode("<p><Placeholder></p>");
string dec = Regex.Replace(s, "^<.*?><.*?>|^<.*?>", "");
return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}
==> The ordering inside an alternation can be important ==>交替内的顺序可能很重要
An alternative would be to use a quantifier instead of an alternation: 一种替代方法是使用量词而不是交替词:
string dec = Regex.Replace(s, "^(?:<.*?>)+", "");
return Regex.Replace(dec, "(?:</.*?>)+$", "");
this would work also for more than 2 tags. 这也适用于2个以上的标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.