简体   繁体   English

C#正则表达式匹配

[英]C# Regex matching

I need to replace some text in C# using RegEx: 我需要使用RegEx替换C#中的一些文本:

string strSText = "<P>Bulleted list</P><UL><P><LI>Bullet 1</LI><P></P><P>
<LI>Bullet 2</LI><P></P><P><LI>Bullet 3</LI><P></UL>"

Basically I need to get rid of the 基本上我需要摆脱

"<P>"

tag(s) introduced between 标签之间引入

"<UL><P><LI>", 
"</LI><P></P><P><LI>" and
"</LI><P></UL>"

I also need to ignore any spaces between these tags when performing the removal. 执行删除操作时,我还需要忽略这些标签之间的任何空格。

So 所以

"</LI><P></P><P><LI>", "</LI>    <P></P><P><LI>", "</LI><P></P><P>   <LI>" or 
"</LI> <P> </P> <P> <LI>"

must all be replaced with 必须全部替换为

"</LI><LI>"

I tried using the following RegEx match for this purpose: 为此,我尝试使用以下RegEx匹配项:

strSText = Regex.Replace(strSText, "<UL>.*<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*<LI>", "</LI><LI>", 
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*</UL>", "</LI></UL>", 
RegexOptions.IgnoreCase);

But it performs a "greedy" match and results in: 但是它执行“贪婪”匹配并导致:

"<P>Bulleted list</P><UL><LI>Bullet 3</LI></UL>"

I then tried using "lazy" match: 然后,我尝试使用“惰性”匹配:

strSText = Regex.Replace(strSText, "<UL>.*?<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?<LI>", "</LI><LI>", 
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?</UL>", "</LI></UL>", 
RegexOptions.IgnoreCase);

and this results in: 结果是:

"<P>Bulleted list</P><UL><LI>Bullet 1</LI></UL>"

But I want the following result, which preserves all other data: 但是我想要以下结果,该结果保留所有其他数据:

"<P>Bulleted list</P><UL><LI>Bullet 1</LI><LI>Bullet 2</LI><LI>Bullet 3</LI></UL>"

Please help! 请帮忙!

The following regexp matches one or more <P> or </P> tags: 以下正则表达式与一个或多个<P></P>标记匹配:

(?:</?P>\s*)+

So if you place that between the other tags you have, you can get rid of them, ie 因此,如果将其放在其他标签之间,则可以摆脱它们,即

strSText = Regex.Replace(strSText, @"<UL>\s*(?:</?P>\s*)+<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+<LI>", "</LI><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+</UL>", "</LI></UL>", RegexOptions.IgnoreCase);

并不是对您问题的真正答案,而是对Jonathon的更多评论:使用HTMLAgilityPack解析HTML

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM