简体   繁体   中英

How do I shorten this fragment of regex code?

I have the following function in Java, which takes in a String comment and returns comment but with all trailing # characters and any # characters that have a space after them remove, and any sequence of multiple adjacent hashtags, such as ### , replaced with a single # character. This is the code segment

private static String replaceHashTagsAndPunctuation(String comment) {
// Remove trailing '#' values
comment = comment.replaceAll("#*$", "");

// Replace instances of multiple '#' values with one '#'
comment = comment.replaceAll("#+", "#");


// Remove punctuation
comment = comment.replaceAll("[^a-zA-Z0-9 #]", "");

// Remove all hashtags that have no word after them
comment = comment.replaceAll("# ", "");

return comment;
}

This is incredibly verbose and ugly. So my question is:
How can I rewrite this using better regex statements to remove all of these parts of the String in one or two lines?
Also, an explanation for why the regex code you suggested works would help me get a better understanding of how regular expressions work in Java.

Idea 1

How would this do to replace two of your calls:

comment = comment.replaceAll("#+([ #])", $1);

Which works slightly different from the two you have by leaving the trailing space there:

comment = comment.replaceAll("#+", "#");
comment = comment.replaceAll("# ", "");

I don't know if the trailing space is important to remove because your words said "remove any # with a space after them" but didn't say to remove the space. However, the code does remove it.

Idea 2

It adds some complexity but you could take care of three of them with:

comment = comment.replaceAll("#+([ #]|$)", $1);

Explaining

The $1 in the 2nd parameter means you replace the matched string with the whatever part of it matched the part inside the parentheses.

The [ #] means either a space or a number sign.

Together the [# ]|$ means a space, a number sign or the end of the string.

The code has nothing wrong per se, but it may be factorized.

For instance:

// LinkedHashMap: insertion order matters!
private static final Map<Pattern, String> REPLACEMENTS
    = new LinkedHashMap<Pattern, String>();

static {
    Pattern pattern;
    String replacement;

    pattern = Pattern.compile("#*$");
    replacement = "";
    REPLACEMENTS.put(pattern, replacement);

    pattern = Pattern.compile("#+");
    replacement = "#";
    REPLACEMENTS.put(pattern, replacement);

    // etc
}

Then your code could be:

private static String replaceHashTagsAndPunctuation(final String comment)
{
    String ret = comment;

    for (final Map.Entry<Pattern, String> entry: REPLACEMENTS.entrySet())
        ret = entry.getKey().matcher(ret).replaceAll(entry.getValue());

    return ret;
}

You can clean up just once:

comment = comment.replaceAll("#+", "#").replaceAll("[^a-zA-Z0-9 #]|# |#*$", "");

Bar in regex means OR .

Well, to begin with, I think your starting regexes are clear and understandable and solid, which are rare and valuable features in regular expressions, so if I saw this in code I was working on I would not change it. Lee's one-liner:

comment = comment.replaceAll("#+([ #]|$)", $1);

is compact and correct and clever, but hard to understand completely at first glance. While I consider myself a wiz at regex, I still have to stop and think and unpack the 3 cases encoded in the regex to figure out what it is going to do.

If you want to pretty your code up without going to such extremes, I would recommend:

// Replace instances of 1 or more consecutive '#' values with a single '#'
comment = comment.replaceAll("#{1,}", "#");  // 1

// Strip out '#' followed by space or at end of line
comment = comment.replaceAll("#( |$)", "");  // 2
  1. Replaces 1 or more "#" with a single "#"
  2. Deletes "#" followed by space or at the end of the line. This deletes a single trailing space after a "#", too. To preserve the space, change the replacement to "$1".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM