简体   繁体   中英

Building a regex, how to remove redundant line breaks?

I have a string like this

"a           a            a  a aaa b c d e f a g a aaa  aa           a       a"

I want to turn it into either

"a b c d e f a g a"

or

"a                        b c d e f a g a                   "

(whichever's easier, it doesn't matter since it'll be HTML)

"a" s are line breaks ( \\r\\n ), in case that changes anything.

Generally your code should be:

s.replace(new RegExp("(\\S)(?:\\s*\\1)+","g"), "$1"); 

Check this fiddle.

But, depends on what those characters a , b , c , ... represent in your case/question, you might need to change \\\\S to other class, such as [^ ] , and then \\\\s to [ ] , if you want to include \\r and \\n to being collapsed as well >>

s.replace(new RegExp("([^ ])(?:[ ]*\\1)+","g"), "$1");

Check this fiddle.

However if a is going to represent string \\r\\n , then you would need a little more complicated pattern >>

s.replace(new RegExp("(\\r\\n|\\S)(?:[^\\S\\r\\n]*\\1)+","g"), "$1");

Check this fiddle.

If I understand the problem correctly, the goal is to remove duplicate copies of a specific character/string, possibly separated by spaces. You can do that by replacing the regular expression (a\\s*)+ with a ; + for multiple consecutive copies, a\\s* for a s followed by spaces How precisely you do that depends on the language: in Perl it's $str =~ s/(a\\s*)+/a /g , in Ruby it's str.gsub(/(a\\s*)+/, "a ") , and so on.

The fact that a is actually \\r\\n shouldn't complicate things, but might mean that the replacement would work better as s/(\\r\\n[ \\t]*)+/\\r\\n/g (since \\s overlaps with \\r and \\n ).

If you need C# code and you want to collapse JUST \\r\\n strings with leading and trailing whitespaces, then the solution is pretty simple:

string result = Regex.Replace(input, @"\s*\r\n\s*", "\r\n");

Check this code here .

Went with this:

private string GetDescriptionFor(HtmlDocument document)
{
    string description = CrawlUsingMetadata(XPath.ResourceDescription, document);
    Regex regex = new Regex(@"(\r\n(?:[ ])*|\n(?:[ ])*){3,}", RegexOptions.Multiline | RegexOptions.IgnoreCase);//(?:[^\S\r\n|\n]*\1)+

    string result = regex.Replace(description, "\n\n");
    string decoded = HttpUtility.HtmlDecode(result);
    return decoded;
}

It does, as it's supposed to, ignore all line breaks except cases where it matches three or more continuous line breaks, ignoring whitespace, and replaces those matches with \\n\\n .

试试这个:

Regex.Replace(inputString, @"(\r\n\s+)", " ");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM