如何在C ++中使用libiconv将ISO-8859-1转换为UTF-8

Question

I'm using libcurl to fetch some HTML pages. 我正在使用libcurl来获取一些HTML页面。

The HTML pages contain some character references like: סלקום HTML页面包含一些字符引用，例如： סלקום

When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨ 当我使用libxml2阅读此内容时，我得到：

is it the ISO-8859-1 encoding? 是ISO-8859-1编码吗？

If so, how do I convert it to UTF-8 to get the correct word. 如果是这样，我如何将其转换为UTF-8以获取正确的单词。

Thanks 谢谢

EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8. 编辑：我得到了解决方案，MSalters是正确的，libxml2确实使用UTF-8。

I added this to eclipse.ini 我将此添加到eclipse.ini

-Dfile.encoding=utf-8 -Dfile.encoding = utf-8

and finally I got Hebrew characters on my Eclipse console. 最后，我在Eclipse控制台上看到了希伯来语字符。 Thanks 谢谢

Answer 1

Have you seen the libxml2 page on i18n ? 您是否在i18n上看到过libxml2页面？ It explains how libxml2 solves these problems. 它说明了libxml2如何解决这些问题。

You will get a ס from libxml2. 你会得到一个ס从libxml2的。 However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨ . 但是，您说的是类似׳₪׳¨׳˜׳ ׳¨ 。 Why do you think that you got that? 你为什么认为自己明白了？ You get an XMLchar* . 您将得到一个XMLchar* 。 How did you convert that pointer into the string above? 您如何将指针转换为上面的字符串？ Did you perhaps use a debugger? 您是否使用了调试器？ Does that debugger know how to render a XMLchar* ? 该调试器是否知道如何呈现XMLchar* ？ My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar* 我敢打赌， XMLchar*是正确的，但是您使用的调试器无法在XMLchar*呈现Unicode。

To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion. 要回答您的最后一个问题， XMLchar*已经是UTF-8，不需要进一步转换。

Answer 2

No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. 否。这些实体对应于字符的Unicode序列号的十进制值。 See this page for example. 例如，请参见此页面。

You can therefore store your Unicode values as int egers and use an algorithm to transform those integers to an UTF-8 multibyte character. 因此，您可以将Unicode值存储为int并使用一种算法将这些整数转换为UTF-8多字节字符。 See UTF-8 specification for this. 参见UTF-8规范。

Answer 3

This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case. 这个答案是在假设编码的文本以UTF-16返回的情况下给出的，事实证明并非如此。

I would guess the encoding is UTF-16 or UCS2. 我想编码是UTF-16或UCS2。 Specify this as input for iconv. 将其指定为iconv的输入。 There might also be an endian issue, have a look here 可能还有字节序问题，请看这里

The c-style way would be (no checking for clarity): C风格的方式是（不检查清晰度）：

iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);

如何在C ++中使用libiconv将ISO-8859-1转换为UTF-8

问题描述

3 个解决方案

解决方案1
3 已采纳 2010-10-20 09:59:31

解决方案2
0 2010-10-20 07:51:35

解决方案3
0 2010-10-20 08:33:30

如何在C ++中使用libiconv将ISO-8859-1转换为UTF-8

问题描述

3 个解决方案

解决方案1 3 已采纳 2010-10-20 09:59:31

解决方案2 0 2010-10-20 07:51:35

解决方案3 0 2010-10-20 08:33:30

解决方案1
3 已采纳 2010-10-20 09:59:31

解决方案2
0 2010-10-20 07:51:35

解决方案3
0 2010-10-20 08:33:30