简体   繁体   中英

How can I find the first strong directionality character of a string in C#?

Assuming I get a string that can have mixed left-to-right and right-to-left content, I want to find the first strong directionality character in it, as defined here .

I think I found a good starting point in this question, but I still can't figure out how the BiDi category is related to the strong directionality characteristic. Is it possible to figure this out in C#?

Instead of relying on the internal implementation I took a slightly different approach that is open for optimizations but gives enough of a basis to answer your question.

I simply download the UnicodeData.txt that is part of the official release of a unicode version. That file contains for each unicodecharacter its number and some semicolon delimited fields. A typical line looks like this:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

the fifth field contains the Bidirectional Class Value .

Armed with this knowledge the naive parser that reads the data and then inspects a demo string with it looks like this:

// hold chars with their Bidi Class Value
var udb = new Dictionary<char, string>();

// download UnicodeData txt file
var cli = new WebClient();
var data = cli.DownloadData("http://www.unicode.org/Public/UNIDATA/UnicodeData.txt");
// parse
using (var ms = new MemoryStream(data))
{
    var sr = new StreamReader(ms, Encoding.UTF8);
    var line = sr.ReadLine();
    while (line != null)
    {
        var fields = line.Split(';');

        int uc = int.Parse(fields[0], NumberStyles.HexNumber);
        // above 0xffff we're lost
        if (uc > 0xffff) break;

        var ch = (char) uc;
        var bca = fields[4];

        udb.Add(ch, bca);

        line = sr.ReadLine();
    }
}

// test string
var s = "123A\xfb1d\x0620";

Console.WriteLine(s);
var pos = 0;
foreach(var c in s)
{

    var bcv = udb[c]; // for a char get the Bidi Class Value
    if (bcv == "L" || bcv == "R" || bcv == "AL")
    {
        Console.WriteLine(
            "{0} - {1} : {2} [{3}]", 
            c, 
            pos, 
            CharUnicodeInfo.GetUnicodeCategory(c), 
            bcv); 
    }
    pos++;
}

When run, you'll see the characters that are of the Strong Type and at which position they were found.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM