Regex to find all variants of a certain character inside a text

Question

I am trying to find unicode variants of a user-entered character in a text for highlighting them. Eg if user enters "Beyonce" i'd like to highlight all text with variants like "Beyoncé" or "Beyônce" or Bèyönce" in the text. Currenty the only idea i have is creating a regex by replacing the input string with a set of character groups like this:

"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"

But this seems to be a very tedious and error prone way of doing it. What I am basically looking for is a regex character group that matches all variants of a given input character, something like \\p{M} but with the possibility to specify the base letter. Is there something available like this in java regex? And if not, how could the regex creation process be improved? I don't think that specifying all variants by hand is going to work in the long run.

Answer 1

There are several ways, an accented character can be represented. There's a good example in the javadoc of java.text.Normalizer:

For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):

  U+00C1    LATIN CAPITAL LETTER A WITH ACUTE

or as two separate characters (the "decomposed" form):

  U+0041    LATIN CAPITAL LETTER A
  U+0301    COMBINING ACUTE ACCENT

The second form would make it relatively easy to access the non-accentuated character, and fortunately Normalizer can help you here:

Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"

You can then use a regex to ignore (or remove) any non-ASCII characters from the string, based on:

[^\p{ASCII}]

Regex to find all variants of a certain character inside a text

Question

1 answers

solution1
2 ACCPTED 2011-03-03 11:01:44

Regex to find all variants of a certain character inside a text

Question

1 answers

solution1 2 ACCPTED 2011-03-03 11:01:44

solution1
2 ACCPTED 2011-03-03 11:01:44