Regex: Want to change case of letter following one of a set, except HTML entity

Question

Examples:

rythm&blues                   -> Rythm&Blues  
.. DON&apos;T WEAR WHITE/LIVE -> Don&apos;t Wear White/Live

First I convert the whole string to lowercase (because I want to have only Uppercase at the start of a word).

I currently do this by using a split pattern: [&/\\\\.\\\\s-] And then I convert the parts' first letter to Uppercase.

It works well, except, that it also converts HTML entities of course: Eg don't is converted to don&Apos;t but that entity should be left alone.

While writing this I discover an additional problem... the initial conversion to lowercase potentially messes up some HTML entities as well. So, the entities should be totally left alone. (Eg Ç is not the same as ç )

An HTML entity is probably matched like this: &[az][AZ][az]{1,5};

I am thinking of doing something with groups, but unfortunately I find it very hard to figure out.

Answer 1

This pattern seems to handle your situation

"\\w+|&#?\\w+;\\w*"

There may be some edge cases, but we can adjust accordingly as they come up.

Pattern Breakdown:

\\\\w+ - Match any word
&#?\\\\w+;\\\\w* - Match an HTML entity

Code Sample:

public static void main(String[] args) throws Exception {
    String[] lines = {
        "rythm&blues",
        ".. DON&apos;T WEAR WHITE/LIVE"
    };

    Pattern pattern = Pattern.compile("\\w+|&#?\\w+;\\w*");
    for (int i = 0; i < lines.length; i++) {
        Matcher matcher = pattern.matcher(lines[i]);
        while (matcher.find()) {
            if (matcher.group().startsWith("&")) {
                // Handle HTML entities 

                // There are letters after the semi-colon that 
                // need to be lower case
                if (!matcher.group().endsWith(";")) {
                    String htmlEntity = matcher.group();
                    int semicolonIndex = htmlEntity.indexOf(";");
                    lines[i] = lines[i].replace(htmlEntity,
                            htmlEntity.substring(0, semicolonIndex) +
                                    htmlEntity.substring(semicolonIndex + 1)
                                            .toLowerCase());
                }
            } else {
                // Uppercase the first letter of the word and lowercase
                // the rest of the word
                lines[i] = lines[i].replace(matcher.group(), 
                        Character.toUpperCase(matcher.group().charAt(0)) + 
                                matcher.group().substring(1).toLowerCase());
            }
        }
    }

    System.out.println(Arrays.toString(lines));
}

Results:

[Rythm&Blues, .. Don&apos;t Wear White/Live]

Answer 2

The solution here will probably be lookahead assertions. That means a split should match & character only if it is not a start of entity. Problem here is that I am not sure whether your data can contain text, that can be mistakenly taken as an entity (basically any stuff ending with ; ). But let's assume for now it does not. This is how such split with negative lookahead pattern could look:

/(?!&apos;)[&/\.\s-]/

Note this is a case with only ' entity. You probably would like to extend possible entity list or provide pattern that matches all valid entities.

Here's a fiddle (JS, but should work in Java as well): http://refiddle.com/refiddles/55a5078c75622d15bb010000

Regex: Want to change case of letter following one of a set, except HTML entity

Question

2 answers

solution1
2 2015-07-14 13:13:51

solution2
0 2015-07-14 12:59:06

Regex: Want to change case of letter following one of a set, except HTML entity

Question

2 answers

solution1 2 2015-07-14 13:13:51

solution2 0 2015-07-14 12:59:06

solution1
2 2015-07-14 13:13:51

solution2
0 2015-07-14 12:59:06