[英]How to convert ISO-8859-1 to UTF-8 using libiconv in C++
I'm using libcurl to fetch some HTML pages. 我正在使用libcurl来获取一些HTML页面。
The HTML pages contain some character references like: סלקום
HTML页面包含一些字符引用,例如: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨ 当我使用libxml2阅读此内容时,我得到:
is it the ISO-8859-1 encoding? 是ISO-8859-1编码吗?
If so, how do I convert it to UTF-8 to get the correct word. 如果是这样,我如何将其转换为UTF-8以获取正确的单词。
Thanks 谢谢
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8. 编辑:我得到了解决方案,MSalters是正确的,libxml2确实使用UTF-8。
I added this to eclipse.ini 我将此添加到eclipse.ini
-Dfile.encoding=utf-8 -Dfile.encoding = utf-8
and finally I got Hebrew characters on my Eclipse console. 最后,我在Eclipse控制台上看到了希伯来语字符。 Thanks 谢谢
Have you seen the libxml2 page on i18n ? 您是否在i18n上看到过libxml2页面 ? It explains how libxml2 solves these problems. 它说明了libxml2如何解决这些问题。
You will get a ס
from libxml2. 你会得到一个ס
从libxml2的。 However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨
. 但是,您说的是类似׳₪׳¨׳˜׳ ׳¨
。 Why do you think that you got that? 你为什么认为自己明白了? You get an XMLchar*
. 您将得到一个XMLchar*
。 How did you convert that pointer into the string above? 您如何将指针转换为上面的字符串? Did you perhaps use a debugger? 您是否使用了调试器? Does that debugger know how to render a XMLchar*
? 该调试器是否知道如何呈现XMLchar*
? My bet is that the XMLchar*
is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
我敢打赌, XMLchar*
是正确的,但是您使用的调试器无法在XMLchar*
呈现Unicode。
To answer your last question, a XMLchar*
is already UTF-8 and needs no further conversion. 要回答您的最后一个问题, XMLchar*
已经是UTF-8,不需要进一步转换。
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. 否。这些实体对应于字符的Unicode序列号的十进制值。 See this page for example. 例如,请参见此页面 。
You can therefore store your Unicode values as int
egers and use an algorithm to transform those integers to an UTF-8 multibyte character. 因此,您可以将Unicode值存储为int
并使用一种算法将这些整数转换为UTF-8多字节字符。 See UTF-8 specification for this. 参见UTF-8规范。
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case. 这个答案是在假设编码的文本以UTF-16返回的情况下给出的,事实证明并非如此。
I would guess the encoding is UTF-16 or UCS2. 我想编码是UTF-16或UCS2。 Specify this as input for iconv. 将其指定为iconv的输入。 There might also be an endian issue, have a look here 可能还有字节序问题,请看这里
The c-style way would be (no checking for clarity): C风格的方式是(不检查清晰度):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.