How to search a string in c# using Regex, ignoring accents;
For example in Notepad++, for ancient Greek, searching with regex : [[=α=]] will return: α, ἀ ἁ, ᾶ, ὰ, ά, ᾳ, ....
I know Notepad++ is using PCRE standard. How to do this in c# ? Is there an equivalence syntax ?
Edit :
I've already tried string normalization. Is not working for Greek. for example : "ᾶ".Normalize(NormalizationForm.FormC) will return ᾶ. It looks like normalization removes accents only in case of "Combining characters". The ᾶ character is a separate character in Unicode!
The System.String.Normalize method seems to be still the key to solve this problem.
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Globalization;
using System.Linq;
public class Program
{
public static void Main()
{
string rawInput = "ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷ";
Console.WriteLine(rawInput);
string normalizedInput = Utility.RemoveDiacritics(rawInput);
string pattern = "α+";
var result = Regex.Matches(normalizedInput, pattern);
if(result.Count > 0)
Console.WriteLine(result[0]);
}
}
public static class Utility
{
public static string RemoveDiacritics(this string str)
{
if (null == str) return null;
var chars =
from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
let uc = CharUnicodeInfo.GetUnicodeCategory(c)
where uc != UnicodeCategory.NonSpacingMark
select c;
return new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
}
}
Output:
ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷᾶ
αααααααααααααααααααααααααα
Original Method by Kaplan:
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
References:
PS: Unfortunately, PCRE.NET , Lucas Trzesniewski's .NET wrapper for the PCRE library does not support (extended) POSIX collating elements.
There are a few questions that might be able to help which have already been answered -
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.