在C中比較Unicode字符串返回與C＃不同的值

Question

所以我試圖在C中編寫一個比較函數，它可以采用UTF-8編碼的Unicode字符串並使用Windows CompareStringEx（）函數，我希望它能像.NET的CultureInfo.CompareInfo.Compare（）一樣工作。

現在我用C編寫的函數在某些時候工作，但並非在所有情況下，我試圖找出原因。 這是一個失敗的情況（傳入C＃，而不是C）：

CultureInfo cultureInfo = new CultureInfo("en-US");
CompareOptions compareOptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth;

string stringA = "คนอ้วน ๆ";
string stringB = "はじめまして";
//Result is -1 which is expected
int result = cultureInfo.CompareInfo.Compare(stringA, stringB);

這是我用C編寫的內容。請記住，這應該采用UTF-8編碼的字符串並使用Windows CompareStringEx（）函數，因此需要進行轉換。

// Compare flags for the string comparison
#define COMPARE_STRING_FLAGS (NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH)

int CompareStrings(int lenA, const void *strA, int lenB, const void *strB) 
{
    LCID ENGLISH_LCID = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT);
    int compareString = -1;

    // Get the size of the strings as UTF-18 encoded Unicode strings. 
    // Note: Passing 0 as the last parameter forces the MultiByteToWideChar function
    // to give us the required buffer size to convert the given string to utf-16s
    int strAWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, NULL, 0);
    int strBWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, NULL, 0);

    // Malloc the strings to store the converted UTF-16 values
    LPWSTR utf16StrA = (LPWSTR) GlobalAlloc(GMEM_FIXED, strAWStrBufferSize * sizeof(WCHAR));
    LPWSTR utf16StrB = (LPWSTR) GlobalAlloc(GMEM_FIXED, strBWStrBufferSize * sizeof(WCHAR));

    // Convert the UTF-8 strings (SQLite will pass them as UTF-8 to us) to standard  
    // windows WCHAR (UTF-16\UCS-2) encoding for Unicode so they can be used in the 
    // Windows CompareStringEx() function.
    if(strAWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, utf16StrA, strAWStrBufferSize);
    }
    if(strBWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, utf16StrB, strBWStrBufferSize);
    }

    // Compare the strings using the windows compare function.
    // Note: We subtract 1 from the size since we don't want to include the null termination character
    if(NULL != utf16StrA && NULL != utf16StrB)
    {
        compareValue = CompareStringEx(L"en-US", COMPARE_STRING_FLAGS, utf16StrA, strAWStrBufferSize - 1, utf16StrB, strBWStrBufferSize - 1, NULL, NULL, 0);
    }

    // In the Windows CompareStringEx() function, 0 indicates an error, 1 indicates less than, 
    // 2 indicates equal to, 3 indicates greater than so subtract 2 to maintain C convention
    if(compareValue > 0)
    {
        compareValue -= 2;
    }

    return compareValue;
}

現在，如果我運行以下代碼，我希望結果為-1，基於.NET實現（見上文），但我得到1表示字符串大於：

char strA[50] = "คนอ้วน ๆ";
char strB[50] = "はじめまして";

// Will be 1 when we expect it to be -1
int result = CompareStrings(strlen(strA), strA, strlen(strB), strB);

關於為什么我得到的結果不同的任何想法？ 我在兩個實現中都使用相同的LCID / cultureInfo和compareOptions，據我所知，轉換是成功的。

僅供參考：此功能將用作SQLite中的自定義排序規則。 與問題無關，但如果有人想知道函數簽名為何如此。

更新：我還確定在.NET 4中運行相同的代碼時，我會看到我在本機代碼中看到的行為。 因此，.NET版本之間存在差異。 請參閱下面的答案，了解其背后的原因。

Answer 1

那么，你的代碼在這里執行了幾個步驟 - 目前尚不清楚它是否是失敗的比較步驟。

作為第一步，我會在.NET代碼和C代碼中utf16StrA你在utf16StrA ， utf16StrB ， stringA和stringB得到的精確UTF-16代碼單元。 我發現您在C代碼中使用的輸入數據存在問題，我不會感到驚訝。

Answer 2

你在這里希望的是你的文本編輯器將以utf-8格式保存源代碼文件。 然后編譯器將以某種方式不將源代碼解釋為utf-8。 至少在我的編譯器上，這太過於希望了：

warning C4566: character represented by universal-character-name '\u0E04' cannot be represented in the current code page (1252)

固定：

const wchar_t* strA = L"คนอ้วน ๆ";
const wchar_t* strB = L"はじめまして";

並刪除轉換代碼。

Answer 3

所以我在聯系Microsoft支持后最終搞清楚了問題。 以下是他們對此問題的看法：

您遇到的問題的原因，即使用相同的比較選項對相同的字符串運行CompareInfo.Compare，但在不同版本的.NET Framework下運行時獲取不同的返回值，是排序規則與Unicode相關聯規范，隨着時間的推移而發展。從歷史上看，.NET已經為並排版本捕獲數據以對應於最新版本的Windows以及當時實現的相應版本的Unicode，因此2.0,3.0和3.5對應於Windows XP或Server 2003的版本，而v4.0符合Vista排序規則。 因此，各種版本的.NET Framework的排序規則隨着時間的推移而發生了變化。

這也意味着當我運行本機代碼時，我調用了遵循Vista排序規則的排序方法，當我在.NET 3.5中運行時，我運行的是使用Windows XP排序規則的排序方法。 對我來說似乎很奇怪，Unicode規范會以導致如此巨大差異的方式發生變化，但顯然就是這種情況。 在我看來，以如此戲劇性的方式更改Unicode規范是打破向后兼容性的絕佳方式。

在C中比較Unicode字符串返回與C＃不同的值

問題描述

3 個解決方案

解決方案1
3 2011-09-07 19:05:38

解決方案2
2 2011-09-07 19:18:39

解決方案3
0 已采納 2011-10-17 22:13:41

在C中比較Unicode字符串返回與C＃不同的值

問題描述

3 個解決方案

解決方案1 3 2011-09-07 19:05:38

解決方案2 2 2011-09-07 19:18:39

解決方案3 0 已采納 2011-10-17 22:13:41

解決方案1
3 2011-09-07 19:05:38

解決方案2
2 2011-09-07 19:18:39

解決方案3
0 已采納 2011-10-17 22:13:41