簡體   English   中英

使用icu庫進行UTF-8到UCS-2的轉換

[英]UTF-8 to UCS-2 conversion with icu library

我目前正在努力解決使用icu庫將UTF-8字符串轉換為UCS-2字符串的問題。 在庫中有很多種方法可以做到這一點,但到目前為止它們似乎都沒有工作,但考慮到這個庫的流行,我假設我做錯了。

首先是公共代碼。 在所有情況下,我都在一個對象上創建並傳遞一個字符串,但是直到它到達轉換步驟,就沒有操作。

當前使用的utf-8字符串只是“ĩ”。

為簡單起見,我將在此代碼中表示用作uniString的字符串

UErrorCode resultCode = U_ZERO_ERROR;

UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);

// Change the callback to error out instead of the default            
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);

int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];                       

printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
    // outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
    outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
        uniString.length(), &resultCode);
    ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
    outputLength ? target : "invalid_char", resultCode, outputLength);

if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
    if (resultCode == U_INVALID_CHAR_FOUND)
    {
        printf("Unmapped input character, cannot be converted to Latin1");                    

        m_pConv = ucnv_open("UCS-2", &resultCode);
        if (U_SUCCESS(resultCode))
        {
            // outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
            outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
                uniString.length(), &resultCode);
            ucnv_close(m_pConv);
        }

        printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
            outputLength ? target : "invalid_char", resultCode, outputLength);

        if (U_SUCCESS(resultCode))
        {
            pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
        }
    }
    else
    {
        printf("DecodeText(): Text contents does not appear to be valid UTF-8");
    }
}
else
{
    printf("DecodeText(): Text successfully converted to Latin1");
    std::string newBody(target, outputLength);
    pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}

問題是ucnv_fromAlgorithmic函數為ucs-2轉換拋出了錯誤U_INVALID_CHAR_FOUND 這對ISO-8859-1嘗試有意義,但不適用於ucs-2。

另一種嘗試是使用你可以看到的ucnv_convert被注釋掉。 此功能嘗試轉換,但在ISO-8859-1嘗試中沒有失敗。

所以問題是,是否有人有這些功能的經驗並且看到不正確的東西或者對於這個角色的轉換假設有什么不對嗎?

你需要重新設置resultCodeU_ZERO_ERROR調用之前ucnv_open 手動報價:

“將參考(C ++)或指針(C)帶到UErrorCode的ICU函數首先測試if(U_FAILURE(errorCode)){return immediately;}以便在這樣的函數鏈中設置錯誤代碼的第一個導致以下不執行任何操作“

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM