简体   繁体   English

正则表达式 - 匹配但排除?

[英]Regular Expression - Match But Exclude?

I have a very simple task of which I am trying to find and replace special characters within a string. 我有一个非常简单的任务,我试图找到并替换字符串中的特殊字符。 My regex is working but sometimes there are italics tags within the string which I do not want to replace however; 我的正则表达式正在工作,但有时在字符串中有斜体标签,但我不想替换它; it is a requirement for me to replace independent "<" and ">" characters which is causing the italics tags to be morphed. 我需要替换导致斜体标签变形的独立“<”和“>”字符。 Is there a way for me match the special characters but exclude the italics pattern? 有没有办法让我匹配特殊字符但排除斜体模式? Here is my code: 这是我的代码:

string sampleText = "<i>This should be in italics</i> but this ¶ character needs to be removed"; 
string sPattern = "[―&<>♫♪–‧₢₳-⅓⅟□¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕ×ØÙÚÛÜÝÞßàáãäåæçèéêëìíîïðñòóôö÷øùüýþÿŒœŠšŸŽžƒ˜-‰›¢€°]";
string replacePattern = "";

string text = System.Text.RegularExpressions.Regex.Replace(sampleText, sPattern, replacePattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase);

When my program executes I get this back: 当我的程序执行时,我得到了回复:

iThis should be in italics/i but this character needs to be removed

So is it possible to for me to match my special characters but then exclude the italics tags? 那么我可以匹配我的特殊字符,然后排除斜体标签吗? If this is not possible the only solution I can think of is removing the italics tags with some string processing and then validate the result with my regex then put the italics tags back in.. 如果这是不可能的,我能想到的唯一解决方案是使用一些字符串处理删除斜体标签,然后用我的正则表达式验证结果,然后将斜体标签放回去。

Any ideas? 有任何想法吗?

Here's an easy way: 这是一个简单的方法:

string sampleText = "<i>This should be in italics</i> but this ¶ character needs to be removed"; 
string sPattern = "(</?i>)|[―&<>♫♪–‧₢₳-⅓⅟□¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕ×ØÙÚÛÜÝÞßàáãäåæçèéêëìíîïðñòóôö÷øùüýþÿŒœŠšŸŽžƒ˜-‰›¢€°]";
string replacePattern = "$1";

string text = Regex.Replace(sampleText, sPattern, replacePattern, RegexOptions.IgnoreCase);

Console.WriteLine(text); 
// <i>This should be in italics</i> but this  character needs to be removed

But this will only work with <i> and </i> tags. 但这只适用于<i></i>标签。 You can expand that to other tags pretty easily (eg "(</?\\w+>)|..." for any simple tag without attributes) but if you get much more complicated than that, I'd recommend parsing the input as XML first, and only applying the pattern to the text of the nodes you're interested in. 你可以很容易地将它扩展到其他标签(例如"(</?\\w+>)|..."对于没有属性的任何简单标签)但是如果你比这复杂得多,我建议将输入解析为首先是XML,并且只将模式应用于您感兴趣的节点的文本。

You can use this: 你可以用这个:

string sPattern = @"(?i)[^<>a-z0-9\s\p{P}]+|<(?!/?i>)|(?<!</?i)>";
string replacePattern = "";

(You can replace \\p{P} by the punctuation you want to preserve) (您可以用要保留的标点符号替换\\ p {P})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM