Examples:
rythm&blues -> Rythm&Blues
.. DON'T WEAR WHITE/LIVE -> Don't Wear White/Live
First I convert the whole string to lowercase (because I want to have only Uppercase at the start of a word).
I currently do this by using a split pattern: [&/\\\\.\\\\s-]
And then I convert the parts' first letter to Uppercase.
It works well, except, that it also converts HTML entities of course: Eg don't
is converted to don&Apos;t
but that entity should be left alone.
While writing this I discover an additional problem... the initial conversion to lowercase potentially messes up some HTML entities as well. So, the entities should be totally left alone. (Eg Ç
is not the same as ç
)
An HTML entity is probably matched like this: &[az][AZ][az]{1,5};
I am thinking of doing something with groups, but unfortunately I find it very hard to figure out.
This pattern seems to handle your situation
"\\w+|&#?\\w+;\\w*"
There may be some edge cases, but we can adjust accordingly as they come up.
Pattern Breakdown:
\\\\w+
- Match any word &#?\\\\w+;\\\\w*
- Match an HTML entity Code Sample:
public static void main(String[] args) throws Exception {
String[] lines = {
"rythm&blues",
".. DON'T WEAR WHITE/LIVE"
};
Pattern pattern = Pattern.compile("\\w+|&#?\\w+;\\w*");
for (int i = 0; i < lines.length; i++) {
Matcher matcher = pattern.matcher(lines[i]);
while (matcher.find()) {
if (matcher.group().startsWith("&")) {
// Handle HTML entities
// There are letters after the semi-colon that
// need to be lower case
if (!matcher.group().endsWith(";")) {
String htmlEntity = matcher.group();
int semicolonIndex = htmlEntity.indexOf(";");
lines[i] = lines[i].replace(htmlEntity,
htmlEntity.substring(0, semicolonIndex) +
htmlEntity.substring(semicolonIndex + 1)
.toLowerCase());
}
} else {
// Uppercase the first letter of the word and lowercase
// the rest of the word
lines[i] = lines[i].replace(matcher.group(),
Character.toUpperCase(matcher.group().charAt(0)) +
matcher.group().substring(1).toLowerCase());
}
}
}
System.out.println(Arrays.toString(lines));
}
Results:
[Rythm&Blues, .. Don't Wear White/Live]
The solution here will probably be lookahead assertions. That means a split should match &
character only if it is not a start of entity. Problem here is that I am not sure whether your data can contain text, that can be mistakenly taken as an entity (basically any stuff ending with ;
). But let's assume for now it does not. This is how such split with negative lookahead pattern could look:
/(?!')[&/\.\s-]/
Note this is a case with only '
entity. You probably would like to extend possible entity list or provide pattern that matches all valid entities.
Here's a fiddle (JS, but should work in Java as well): http://refiddle.com/refiddles/55a5078c75622d15bb010000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.