简体   繁体   English

.NET如何确定角色的Unicode类别?

[英]How does .NET determine the Unicode category of a character?

I was looking in mscorelib.dll with .NET Reflector, and stumbled upon the Char class. 我正在使用.NET Reflector查找mscorelib.dll,并偶然发现了Char类。 I always wondered how methods like Char.isLetter was done. 我总是想知道Char.isLetter这样的方法是如何完成的。 I expected a huge list of test, but, buy digging a little bit, I found a really short code that determine the Unicode category. 我期待一个巨大的测试列表,但是,买一点点,我找到了一个确定Unicode类别的真正短代码。 However, this code uses some kind of tables and some bitshifting voodoo. 但是,这段代码使用某种表格和一些比特变换伏都教。 Can anyone explain to me how this is done, or point me to some resources? 任何人都可以向我解释这是如何完成的,或者指向一些资源?

EDIT : Here's the code. 编辑:这是代码。 It's in System.Globalization.CharUnicodeInfo. 它位于System.Globalization.CharUnicodeInfo中。

internal static unsafe byte InternalGetCategoryValue(int ch, int offset)
{
    ushort num = s_pCategoryLevel1Index[ch >> 8];
    num = s_pCategoryLevel1Index[num + ((ch >> 4) & 15)];
    byte* numPtr = (byte*) (s_pCategoryLevel1Index + num);
    byte num2 = numPtr[ch & 15];
    return s_pCategoriesValue[(num2 * 2) + offset];
}

s_pCategoryLevel1Index is a short* and s_pCategoryValues is a byte* s_pCategoryLevel1Index是一个short*s_pCategoryValues是一个byte*

Both are created in the CharUnicodeInfo static constructor : 两者都是在CharUnicodeInfo静态构造函数中创建的:

 static unsafe CharUnicodeInfo()
{
    s_pDataTable = GlobalizationAssembly.GetGlobalizationResourceBytePtr(typeof(CharUnicodeInfo).Assembly, "charinfo.nlp");
    UnicodeDataHeader* headerPtr = (UnicodeDataHeader*) s_pDataTable;
    s_pCategoryLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToCategoriesIndex);
    s_pCategoriesValue = s_pDataTable + ((byte*) headerPtr->OffsetToCategoriesValue);
    s_pNumericLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToNumbericIndex);
    s_pNumericValues = s_pDataTable + ((byte*) headerPtr->OffsetToNumbericValue);
    s_pDigitValues = (DigitValues*) (s_pDataTable + headerPtr->OffsetToDigitValue);
    nativeInitTable(s_pDataTable);
}

Here is the UnicodeDataHeader. 这是UnicodeDataHeader。

internal struct UnicodeDataHeader
{
    // Fields
    [FieldOffset(40)]
    internal uint OffsetToCategoriesIndex;
    [FieldOffset(0x2c)]
    internal uint OffsetToCategoriesValue;
    [FieldOffset(0x34)]
    internal uint OffsetToDigitValue;
    [FieldOffset(0x30)]
    internal uint OffsetToNumbericIndex;
    [FieldOffset(0x38)]
    internal uint OffsetToNumbericValue;
    [FieldOffset(0)]
    internal char TableName;
    [FieldOffset(0x20)]
    internal ushort version;
}

Note : I Hope this doesn't break any licence. 注意:我希望这不会破坏任何许可证。 If so, I'll remove the code. 如果是这样,我将删除代码。

The basic information is stored in charinfo.nlp which is embedded in mscorlib.dll as a resource and loaded at runtime. 基本信息存储在charinfo.nlp ,它作为资源嵌入在mscorlib.dll中并在运行时加载。 The specifics of the file are probably only known to Microsoft but suffice it to say that it probably is a lookup table in a fashion. 该文件的细节可能只有微软知道,但足以说它可能是一种时尚的查找表。

EDIT 编辑

According to MSDN : 根据MSDN

This enumeration is based on The Unicode Standard, version 5.0. 此枚举基于Unicode标准5.0版。 For more information, see the " UCD File Format " and " General Category Values " subtopics at the Unicode Character Database. 有关更多信息,请参阅Unicode字符数据库中的“ UCD文件格式 ”和“ 常规类别值 ”子主题。

That looks like a b-tree of sorts. 这看起来像是各种各样的b树。

The advantage is that a bunch of regions can all point to the same "character unknown" block, instead of needing a unique element in the array for each possible Char value. 优点是一堆区域都可以指向相同的“字符未知”块,而不是在每个可能的Char值中需要数组中的唯一元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM