[英]How can I find the first strong directionality character of a string in C#?
Assuming I get a string that can have mixed left-to-right and right-to-left content, I want to find the first strong directionality character in it, as defined here . 假设我获得可以有混合左到右,右到左内容的字符串,我想找到的第一个强方向性性格在里面,定义在这里 。
I think I found a good starting point in this question, but I still can't figure out how the BiDi category is related to the strong directionality characteristic. 我想我在这个问题上找到了一个很好的起点,但我仍然无法弄清楚BiDi类别与强方向性特征之间的关系。 Is it possible to figure this out in C#?
有可能在C#中弄清楚吗?
Instead of relying on the internal implementation I took a slightly different approach that is open for optimizations but gives enough of a basis to answer your question. 我没有依赖于内部实现,而是采取了一种略有不同的方法,该方法可以进行优化,但是有足够的基础来回答您的问题。
I simply download the UnicodeData.txt
that is part of the official release of a unicode version. 我只是下载
UnicodeData.txt
,它是unicode版本的官方发行版的一部分。 That file contains for each unicodecharacter its number and some semicolon delimited fields. 该文件包含每个unicode字符的编号和一些用分号分隔的字段。 A typical line looks like this:
典型的行如下所示:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
the fifth field contains the Bidirectional Class Value . 第五个字段包含双向类值 。
Armed with this knowledge the naive parser that reads the data and then inspects a demo string with it looks like this: 掌握了这些知识之后,天真的解析器将读取数据,然后使用它检查演示字符串,如下所示:
// hold chars with their Bidi Class Value
var udb = new Dictionary<char, string>();
// download UnicodeData txt file
var cli = new WebClient();
var data = cli.DownloadData("http://www.unicode.org/Public/UNIDATA/UnicodeData.txt");
// parse
using (var ms = new MemoryStream(data))
{
var sr = new StreamReader(ms, Encoding.UTF8);
var line = sr.ReadLine();
while (line != null)
{
var fields = line.Split(';');
int uc = int.Parse(fields[0], NumberStyles.HexNumber);
// above 0xffff we're lost
if (uc > 0xffff) break;
var ch = (char) uc;
var bca = fields[4];
udb.Add(ch, bca);
line = sr.ReadLine();
}
}
// test string
var s = "123A\xfb1d\x0620";
Console.WriteLine(s);
var pos = 0;
foreach(var c in s)
{
var bcv = udb[c]; // for a char get the Bidi Class Value
if (bcv == "L" || bcv == "R" || bcv == "AL")
{
Console.WriteLine(
"{0} - {1} : {2} [{3}]",
c,
pos,
CharUnicodeInfo.GetUnicodeCategory(c),
bcv);
}
pos++;
}
When run, you'll see the characters that are of the Strong Type and at which position they were found. 运行时,您将看到“强类型”字符以及在何处找到它们。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.