简体   繁体   English

如何从C#中的字符串中删除多个重复和不必要的标点符号?

[英]How to remove multiple, repeating & unnecessary punctuation from string in C#?

Considering strings like this: 考虑这样的字符串:

"This is a string....!"
"This is another...!!"
"What is this..!?!?"
...
// There are LOTS of examples of weird/angry sentence-endings like the ones above.

I want to replace the unnecessary punctuation at the end to make it look like this: 我想在最后替换不必要的标点符号,使其看起来像这样:

"This is a string!"
"This is another!"
"What is this?"

What I basically do is: - split by space - check if last char in string contains a punctuation - start replacing with the patterns below 我基本上要做的是:-按空格分割-检查字符串中的最后一个字符是否包含标点符号-开始用下面的模式替换

I have tried a very big ".Replace(string, string)" function, but it does not work - there has to be a simpler regex I guess. 我尝试了一个非常大的“ .Replace(string,string)”函数,但是它不起作用-我猜必须有一个更简单的正则表达式。

Documentation: 文档:

Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string. 返回一个新字符串,在该字符串中,当前实例中所有出现的指定字符串都被另一个指定字符串替换。

As well as: 以及:

Because this method returns the modified string, you can chain together successive calls to the Replace method to perform multiple replacements on the original string. 由于此方法返回修改后的字符串,因此可以将对Replace方法的后续调用链接在一起,以对原始字符串执行多次替换。

Anything is wrong here. 这里有什么问题。

EDIT: ALL the proposed solutions work fine! 编辑:所有建议的解决方案都可以正常工作! Thank you very much! 非常感谢你! This one was the best suited solution for my project: 这是最适合我的项目的解决方案:

Regex re = new Regex("[.?!]*(?=[.?!]$)");
string output = re.Replace(input, "");

Your solution works almost fine ( demo ), the only issue is when the same sequence could be matched starting at different spots. 您的解决方案几乎可以正常工作( 演示 ),唯一的问题是何时可以从不同位置开始匹配相同的序列。 For example, ..!?!? 例如, ..!?!? from your last line is not part of the substitution list, so ..!? 最后一行的内容不在替换列表中,所以..!? and !? !? get replaced by two separate matches, producing ?? 被两个单独的匹配代替,产生?? in the output. 在输出中。

It looks like your strategy is pretty straightforward: in a chain of multiple punctuation characters the last character wins. 看来您的策略非常简单:在多个标点符号串中,最后一个字符获胜。 You can use regular expressions to do the replacement: 您可以使用正则表达式进行替换:

[!?.]*([!?.])

and replace it with $1 , ie the capturing group that has the last character: 并将其替换为$1 ,即具有最后一个字符的捕获组:

string s;
while ((s = Console.ReadLine()) != null) {
    s = Regex.Replace(s, "[!?.]*([!?.])", "$1");
    Console.WriteLine(s);
}

Demo 演示

Simply 只是

[.?!]*(?=[.?!]$)

should do it for you. 应该为你做。 Like 喜欢

Regex re = new Regex("[.?!]*(?=[.?!]$)");
Console.WriteLine(re.Replace("This is a string....!", ""));

This replaces all punctuations but the last with nothing. 这将替换所有标点符号,但最后一个不包含任何标点符号。

[.?!]* matches any number of consecutive punctuation characters, and the (?=[.?!]$) is a positive lookahead making sure it leaves one at the end of the string. [.?!]*匹配任意数量的连续标点字符,并且(?=[.?!]$)是一个正向超前查询,可确保在字符串末尾留一个。

See it here at ideone . 在ideone上看到它

Or you can do it without regExps: 或者,您也可以不使用regExps:

    string TrimPuncMarks(string str)
    {
        HashSet<char> punctMarks = new HashSet<char>() {'.', '!', '?'};

        int i = str.Length - 1;
        for (; i >= 0; i--)
        {
            if (!punctMarks.Contains(str[i]))
                break;
        }

        // the very last punct mark or null if there were no any punct marks in the end
        char? suffix = i < str.Length - 1 ? str[str.Length - 1] : (char?)null;

        return str.Substring(0, i+1) + suffix;
    }

    Debug.Assert("What is this?" == TrimPuncMarks("What is this..!?!?"));
    Debug.Assert("What is this" == TrimPuncMarks("What is this"));
    Debug.Assert("What is this." == TrimPuncMarks("What is this."));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM