Utf-8到URI百分比编码

Question

I'm trying to convert Unicode code points to percent encoded UTF-8 code units. 我正在尝试将Unicode代码点转换为百分比编码的UTF-8代码单元。

The Unicode -> UTF-8 conversion seems to be working correctly as shown by some testing with Hindi and Chinese characters which show up correctly in Notepad++ with UTF-8 encoding, and can be translated back properly. Unicode - > UTF-8转换似乎工作正常，正如一些使用印地语和中文字符的测试所示，它们在带有UTF-8编码的Notepad ++中正确显示，并且可以正确地翻译回来。

I thought the percent encoding would be as simple as adding '%' in front of each UTF-8 code unit, but that doesn't quite work. 我认为百分比编码就像在每个UTF-8代码单元前添加'％'一样简单，但这并不是很有效。 Rather than the expected %E5%84%A3 , I'm seeing %xE5%x84%xA3 (for the unicode U+5123). 而不是预期的％E5％84％A3 ，我看到％xE5％x84％xA3 （对于unicode U + 5123）。

在此输入图像描述

What am I doing wrong? 我究竟做错了什么？

Added code (note that utf8.h belongs to the UTF8-CPP library). 添加了代码（请注意，utf8.h属于UTF8-CPP库）。

#include <fstream>
#include <iostream>
#include <vector>
#include "utf8.h"

std::string unicode_to_utf8_units(int32_t unicode)
{
    unsigned char u[5] = {0,0,0,0,0};
    unsigned char *iter = u, *limit = utf8::append(unicode, u);
    std::string s;
    for (; iter != limit; ++iter) {
        s.push_back(*iter);
    }
    return s;
}

int main()
{
    std::ofstream ofs("test.txt", std::ios_base::out);
    if (!ofs.good()) {
        std::cout << "ofstream encountered a problem." << std::endl;
        return 1;
    }

    utf8::uint32_t unicode = 0x5123;
    auto s = unicode_to_utf8_units(unicode);
    for (auto &c : s) {
        ofs << "%" << c;
    }

    ofs.close();

    return 0;
}

Answer 1

You actually need to convert byte values to the corresponding ASCII strings, for example: 实际上，您需要将字节值转换为相应的ASCII字符串，例如：

"é" in UTF-8 is the value { 0xc3, 0xa9 } . UTF-8中的"é"是值{ 0xc3, 0xa9 } 。 Please not that these are bytes, char values in C++. 请注意，这些是C ++中的字节， char值。

Each byte needs to be converted to: "%C3" and "%C9" respectively. 每个字节需要分别转换为： "%C3"和"%C9" 。

The best way to do so is to use sstream : 最好的方法是使用sstream ：

std::ostringstream out;
std::string utf8str = "\xE5\x84\xA3";

for (int i = 0; i < utf8str.length(); ++i) {
    out << '%' << std::hex << std::uppercase << (int)(unsigned char)utf8str[i];
}

Or in C++11: 或者在C ++ 11中：

for (auto c: utf8str) {
    out << '%' << std::hex << std::uppercase << (int)(unsigned char)c;
}

Please note that the bytes need to be cast to int , because else the << operator will use the litteral binary value. 请注意，需要将字节转换为int ，否则<<运算符将使用litteral二进制值。 First casting to unsigned char is needed because otherwise, the sign bit will propagate to the int value, causing output of negative values like FFFFFFE5 . 首先需要转换为unsigned char ，否则，符号位将传播到int值，从而导致输出负值，如FFFFFFE5 。

Utf-8到URI百分比编码

问题描述

1 个解决方案

解决方案1
3 已采纳 2013-10-06 17:51:34

Utf-8到URI百分比编码

问题描述

1 个解决方案

解决方案1 3 已采纳 2013-10-06 17:51:34

解决方案1
3 已采纳 2013-10-06 17:51:34