Regular Expressions in C# for Character Equivalents

Question

How to search a string in c# using Regex, ignoring accents;

For example in Notepad++, for ancient Greek, searching with regex : [[=α=]] will return: α, ἀ ἁ, ᾶ, ὰ, ά, ᾳ, ....

I know Notepad++ is using PCRE standard. How to do this in c# ? Is there an equivalence syntax ?

Edit :

I've already tried string normalization. Is not working for Greek. for example : "ᾶ".Normalize(NormalizationForm.FormC) will return ᾶ. It looks like normalization removes accents only in case of "Combining characters". The ᾶ character is a separate character in Unicode!

Answer 1

The System.String.Normalize method seems to be still the key to solve this problem.

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Globalization;
using System.Linq;

public class Program
{
    public static void Main()
    {
        string rawInput = "ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷ";
        Console.WriteLine(rawInput);
        string normalizedInput = Utility.RemoveDiacritics(rawInput);    
        string pattern = "α+";

        var result = Regex.Matches(normalizedInput, pattern);
        if(result.Count > 0)
            Console.WriteLine(result[0]);    
    }
}

public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        return new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
    }
}

Output:

ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷᾶ
αααααααααααααααααααααααααα

Demo

Original Method by Kaplan:

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();        
    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }       
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

References:

Michael S. Kaplan: FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler)
Michael S. Kaplan: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)
Code adopted from: See also: How do I remove diacritics (accents) from a string in .NET?

PS: Unfortunately, PCRE.NET , Lucas Trzesniewski's .NET wrapper for the PCRE library does not support (extended) POSIX collating elements.

Answer 2

There are a few questions that might be able to help which have already been answered -

How do I remove diacritics (accents) from a string in .NET?

Regex accent insensitive?

Regular Expressions in C# for Character Equivalents

Question

2 answers

solution1
2 ACCPTED 2018-04-07 07:00:52

solution2
0 2018-04-06 18:09:31

Regular Expressions in C# for Character Equivalents

Question

2 answers

solution1 2 ACCPTED 2018-04-07 07:00:52

solution2 0 2018-04-06 18:09:31

solution1
2 ACCPTED 2018-04-07 07:00:52

solution2
0 2018-04-06 18:09:31