简体繁体 English

如何将Unicode转义字符转换为utf8？

[英]How to convert Unicode escaped characters to utf8?

原文 2012-11-30 19:47:43 3 2 c/ encoding/ utf-8

I saw the other questions about the subject but all of them were missing important details: I want to convert \%2F\%2F\מ\ר\כ\ז to utf8. 我看到了关于这个主题的其他问题但是所有这些问题都缺少重要的细节：我想将\%2F\%2F\מ\ר\כ\ז为utf8。 I understand that you look through the stream for \\u followed by four hex which you convert to bytes. 我知道你通过流查看\\ u后跟四个十六进制转换为字节。 The problems are as follows: 问题如下：

I heard that sometimes you look for 4 bytes after and sometimes 6 bytes after, is this correct? 我听说有时你会查找4个字节，有时候会查找6个字节，这是正确的吗？ If so, then how do you determine which it is? 如果是这样，那么你如何确定它是什么？ Eg is \%2F 4 or 6 bytes? 例如是\%2F 4或6个字节？
In the case of \% this maps to one byte instead of two (0x25), why? 在\%的情况下，这映射到一个字节而不是两个（0x25），为什么？ Is the four hex supposed to represent utf16 which i am supposed to convert to utf8? 是否应该将四个十六进制表示为utf16，我应该将其转换为utf8？
How do I know whether the text is supposed to be the literal characters \% or the unicode sequence? 我怎么知道文本是否应该是文字字符\%或unicode的序列？ Does that mean that all backslashes must be escaped in the stream? 这是否意味着必须在流中转义所有反斜杠？
Lastly, am I being stupid in doing this by hand when I can use iconv to do this for me? 最后，当我可以使用iconv为我做这个时，我是手工做这个傻吗？

2 个解决方案

If you have the iconv interfaces at your disposal, you can simply convert the \ģ\ꯍ etc. sequences to an array of bytes 01 23 AB CD ..., replacing any unescaped ASCII characters with a 00 byte followed by the ASCII byte, then run the array through iconv with a conversion descriptor obtained by iconv_open("UTF-8", "UTF-16-BE") . 如果您有iconv接口\ģ\ꯍ ，您只需将\ģ\ꯍ等序列转换为字节数组01 23 AB CD ...，用00字节后跟ASCII字节替换任何未转义的ASCII字符，然后通过iconv运行数组，其中转换描述符由iconv_open("UTF-8", "UTF-16-BE") 。

Of course you can also do it much more efficiently working directly with the input yourself, but that requires reading and understanding the Unicode specification of UTF-16 and UTF-8. 当然，您也可以更高效地直接使用输入，但这需要阅读并理解UTF-16和UTF-8的Unicode规范。

In some conventions (like C++11 string literals), you parse a specific number of hex digits, like four after \\u\u003c/code> and eight after \\U . 在某些约定（如C ++ 11字符串文字）中，您解析特定数量的十六进制数字，例如\\u\u003c/code>之后的四位数和\\U之后的八位数字。 That may or may not be the convention with the input you provided, but it seems a reasonable guess. 这可能是也可能不是您提供的输入的约定，但似乎是一个合理的猜测。 Other styles, like C++'s \\x you parse as many hex digits as you can find after the \\x , which means that you have to jump through some hoops if you do want to put a literal hex digit immediately after one of these escaped characters. 其他样式，比如C ++的\\x你可以解析在\\x之后可以找到的十六进制数字，这意味着如果你想在其中一个转义字符之后立即输入一个字母十六进制数字，你必须跳过一些箍。

Once you have all the values, you need to know what encoding they're in (eg, UTF-16 or UTF-32) and what encoding you want (eg, UTF-8). 获得所有值后，您需要知道它们所处的编码（例如，UTF-16或UTF-32）以及您想要的编码（例如，UTF-8）。 You then use a function to create a new string in the new encoding. 然后，您可以使用函数在新编码中创建新字符串。 You can write such a function (if you know enough about both encoding formats), or you can use a library. 您可以编写这样的函数（如果您对两种编码格式都了解得足够多），或者您可以使用库。 Some operating systems may provide such a function, but you might want to use a third-party library for portability. 某些操作系统可能提供此类功能，但您可能希望使用第三方库来实现可移植性。