简体   繁体   中英

Regular Expressions in C# for Character Equivalents

How to search a string in c# using Regex, ignoring accents;

For example in Notepad++, for ancient Greek, searching with regex : [[=α=]] will return: α, ἀ ἁ, ᾶ, ὰ, ά, ᾳ, ....

I know Notepad++ is using PCRE standard. How to do this in c# ? Is there an equivalence syntax ?

Edit :

I've already tried string normalization. Is not working for Greek. for example : "ᾶ".Normalize(NormalizationForm.FormC) will return ᾶ. It looks like normalization removes accents only in case of "Combining characters". The ᾶ character is a separate character in Unicode!

The System.String.Normalize method seems to be still the key to solve this problem.

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Globalization;
using System.Linq;

public class Program
{
    public static void Main()
    {
        string rawInput = "ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷ";
        Console.WriteLine(rawInput);
        string normalizedInput = Utility.RemoveDiacritics(rawInput);    
        string pattern = "α+";

        var result = Regex.Matches(normalizedInput, pattern);
        if(result.Count > 0)
            Console.WriteLine(result[0]);    
    }
}

public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        return new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
    }
}

Output:

ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷᾶ
αααααααααααααααααααααααααα

Demo

Original Method by Kaplan:

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();        
    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }       
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

References:

PS: Unfortunately, PCRE.NET , Lucas Trzesniewski's .NET wrapper for the PCRE library does not support (extended) POSIX collating elements.

There are a few questions that might be able to help which have already been answered -

How do I remove diacritics (accents) from a string in .NET?

Regex accent insensitive?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM