简体   繁体   English

utf8 < - > utf16:codecvt性能不佳

[英]utf8 <-> utf16: codecvt poor performance

I'm looking onto some of my old (and exclusively win32 oriented) stuff and thinking about making it more modern/portable - ie reimplementing some widely reusable parts in C++11. 我正在研究一些旧的(并且专门针对win32)的东西,并考虑使它更现代/可移植 - 即在C ++ 11中重新实现一些可广泛重用的部分。 One of these parts is convertin between utf8 and utf16. 其中一个部分是在utf8和utf16之间进行转换。 In Win32 API I'm using MultiByteToWideChar / WideCharToMultiByte , trying to port that stuff to C++11 using sample code from here: https://stackoverflow.com/a/14809553 . 在Win32 API中,我正在使用MultiByteToWideChar / WideCharToMultiByte ,尝试使用以下示例代码将这些内容移植到C ++ 11: https//stackoverflow.com/a/14809553 The result is 结果是

Release build (compiled by MSVS 2013, run on Core i7 3610QM) 发布版本(由MSVS 2013编译,在Core i7 3610QM上运行)

stdlib                   = 1587.2 ms
Win32                    =  127.2 ms

Debug build 调试构建

stdlib                   = 5733.8 ms
Win32                    =  127.2 ms

The question is - is there something wrong with the code? 问题是 - 代码有问题吗? If everything seems to be OK - is there some good reason for the such performance difference? 如果一切似乎都没问题 - 这种性能差异是否有充分的理由?

Test code is below: 测试代码如下:

#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <clocale>  
#include <codecvt> 

#define XU_BEGIN_TIMER(NAME)                       \
    {                                           \
        LARGE_INTEGER   __freq;                 \
        LARGE_INTEGER   __t0;                   \
        LARGE_INTEGER   __t1;                   \
        double          __tms;                  \
        const char*     __tname = NAME;         \
        char            __tbuf[0xff];           \
                                                \
        QueryPerformanceFrequency(&__freq);     \
        QueryPerformanceCounter(&__t0);         

#define XU_END_TIMER()                             \
        QueryPerformanceCounter(&__t1);         \
        __tms = (__t1.QuadPart - __t0.QuadPart) * 1000.0 / __freq.QuadPart; \
        sprintf_s(__tbuf, sizeof(__tbuf), "    %-24s = %6.1f ms\n", __tname, __tms ); \
        OutputDebugStringA(__tbuf);             \
        printf(__tbuf);                         \
    }   

std::string read_utf8() {
    std::ifstream infile("C:/temp/UTF-8-demo.txt");
    std::string fileData((std::istreambuf_iterator<char>(infile)),
                         std::istreambuf_iterator<char>());
    infile.close();

    return fileData;
}

void testMethod() {
    std::setlocale(LC_ALL, "en_US.UTF-8");
    std::string source = read_utf8();
    {
        std::string utf8;

        XU_BEGIN_TIMER("stdlib") {
            for( int i = 0; i < 1000; i++ ) {
                std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert2utf16;
                std::u16string utf16 = convert2utf16.from_bytes(source);

                std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert2utf8;
                utf8 = convert2utf8.to_bytes(utf16);
            }
        } XU_END_TIMER();

        FILE* output = fopen("c:\\temp\\utf8-std.dat", "wb");
        fwrite(utf8.c_str(), 1, utf8.length(), output);
        fclose(output);
    }

    char* utf8 = NULL;
    int cchA = 0;

    {
        XU_BEGIN_TIMER("Win32") {
            for( int i = 0; i < 1000; i++ ) {
                WCHAR* utf16 = new WCHAR[source.length() + 1];
                int cchW;
                utf8 = new char[source.length() + 1];

                cchW = MultiByteToWideChar(
                    CP_UTF8, 0, source.c_str(), source.length(),
                    utf16, source.length() + 1);

                cchA = WideCharToMultiByte(
                    CP_UTF8, 0, utf16, cchW,
                    utf8, source.length() + 1, NULL, false);

                delete[] utf16;
                if( i != 999 )
                    delete[] utf8;
            }
        } XU_END_TIMER();

        FILE* output = fopen("c:\\temp\\utf8-win.dat", "wb");
        fwrite(utf8, 1, cchA, output);
        fclose(output);

        delete[] utf8;
    }
}

In my own testing, I found that the constructor call for wstring_convert has a massive overhead, at least on Windows. 在我自己的测试中,我发现wstring_convert的构造函数调用有很大的开销,至少在Windows上。 As other answers suggest, you'll probably struggle to beat the native Windows implementation, but try modifying your code to construct the converter outside of the loop. 正如其他答案所示,您可能很难击败本机Windows实现,但尝试修改代码以在循环之外构建转换器。 I expect you'll see an improvement of between 5x and 20x, particularly in a debug build. 我希望你会看到5x到20x之间的改进,特别是在调试版本中。

Win32's UTF8 transcode since Vista uses SSE internally to great effect, something very few other UTF transcoders do. Win32的UTF8转码因为Vista在内部使用SSE效果很好,这是其他极少数UTF转码器所做的事情。 I suspect it will be impossible to beat with even the most highly optimized portable code. 我怀疑即使是最优化的便携式代码也无法击败它。

However, this number you've given for codecvt is simply exceptionally slow if it's taking over 10x the time, and suggests a naive implementation. 但是,你为codecvt提供的这个数字如果占用时间超过10倍则非常慢,并建议一个简单的实现。 While writing my own UTF-8 decoder, I was able to reach within 2-3x the perf of Win32. 在编写我自己的UTF-8解码器时,我能够达到Win32的性能的2-3倍。 There's a lot of room for improvement here, but you'd need to custom implement a codecvt to get it. 这里有很大的改进空间,但你需要自定义一个codecvt才能获得它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM