简体   繁体   English

C ++字符编码

[英]C++ Character Encoding

This is my C++ Code where i'm trying to encode the received file path to utf-8. 这是我的C ++代码,我尝试在其中编码到utf-8的接收文件路径。

#include <string>
#include <iostream>

using namespace std;
void latin1_to_utf8(unsigned char *in, unsigned char *out);
string encodeToUTF8(string _strToEncode);

int main(int argc,char* argv[])
{

// Code to receive fileName from Sockets
cout << "recvd ::: " << recvdFName << "\n";
string encStr = encodeToUTF8(recvdFName);
cout << "encoded :::" << encStr << "\n";
}

void latin1_to_utf8(unsigned char *in, unsigned char *out)
{
 while (*in)
 {
  if (*in<128)
  {
    *out++=*in++;
  }
  else
  {
    *out++=0xc2+(*in>0xbf);
    *out++=(*in++&0x3f)+0x80;
  }
 }
 *out = '\0';
}

string encodeToUTF8(string _strToEncode)
{
  int len= _strToEncode.length();
  unsigned char* inpChar = new unsigned char[len+1];
  unsigned char* outChar = new unsigned char[2*(len+1)];
  memset(inpChar,'\0',len+1);
  memset(outChar,'\0',2*(len+1));
  memcpy(inpChar,_strToEncode.c_str(),len);
  latin1_to_utf8(inpChar,outChar);
  string _toRet = (const char*)(outChar);
  delete[] inpChar;
  delete[] outChar;
  return _toRet;
 }

And the OutPut is 而输出是

recvd ::: /Users/zeus/ÄÈÊÑ.txt  
encoded ::: /Users/zeus/AÌEÌEÌNÌ.txt

The above function latin1_to_utf8 is provided as an solution Convert ISO-8859-1 strings to UTF-8 in C/C++ , Looks like it works.[Answer is accepted]. 上面的函数latin1_to_utf8是作为解决方案提供的, 在C / C ++中将ISO-8859-1字符串转换为UTF-8 ,看起来像是可行的。[接受答案]。 So i think i must be making some mistake, but i'm not able to identify what it is. 因此,我认为我一定会犯一些错误,但是我无法确定它是什么。 Can someone help me out with this , Please. 有人可以帮我这个忙吗?

I have first posted this question in Codereview,but i'm not getting any answers out there. 我首先在Codereview中发布了这个问题,但是我没有得到任何答案。 So sorry for the duplication. 很抱歉重复。

Do you use any platform or you build it on the top of std? 您使用任何平台还是在std顶部构建它? I am sure that many people use such convertions and therefore there is library. 我敢肯定,很多人都使用这种转换,因此这里有图书馆。 I strongly recommend you to use the libraray, because the library is tested and usually the best know way is used. 我强烈建议您使用libraray,因为该库已经过测试,并且通常使用最佳方法。

A library which I found doing this is boost locale 我发现这样做的库是提升语言环境

This is standard. 这是标准的。 If you use QT I will recommend you to use the QT conversion library for this (it is platform independant) 如果您使用QT,我建议您为此使用QT转换库(与平台无关)

QT QT

In case you want to do it yourself (you want to see how it works or for any other reason) 1. Make sure that you allocate memory ! 万一您想自己做(您想看一下它是如何工作的或出于其他任何原因)1.确保分配了内存! - this is very important in C,C++ . - 这在C,C ++中非常重要 Since you use iostream use new to allocate memory and delete to release it (this is also important C++ won't figure out when to release it for sure. This is developer's job here - C++ is hardcore :D ) 2. Check that you allocate the right size of memory. 由于您使用iostream,所以请使用new来分配内存并删除以释放它(这也很重要,C ++不会确定何时释放它。这是开发人员的工作-C ++是Hardcore:D)2.检查是否分配了内存正确的内存大小。 I expect unicode to be larger memory (it encodes more symbols and sometimes uses large numbers). 我希望unicode可以占用更大的内存(它编码更多的符号,有时使用大数字)。 3. As already mentioned above read from somewhere (terminal or file) but output in new file. 3.如上所述,从某处(终端或文件)读取,但在新文件中输出。 After that when you open the file with text editor make sure you set the encoding to be utf-8 ( your text editor has to know how to interpretate the data) 之后,使用文本编辑器打开文件时,请确保将编码设置为utf-8(文本编辑器必须知道如何解释数据)

I hope that helps. 希望对您有所帮助。

You are first outputting the original Latin-1 string to a terminal expecting a certain encoding, probably Latin-1. 首先,您将原始的Latin-1字符串输出到需要某种编码(可能是Latin-1)的终端。 You then transcode to UTF-8 and output it to the same terminal, which interprets it differently. 然后,您将代码转码为UTF-8并将其输出到同一终端,这将以不同的方式进行解释。 Classic mojibake. 经典的mojibake。 Try the following with the output instead: 尝试对输出执行以下操作:

for(size_t i=0, len=strlen(outChar); i!=len; ++i)
    std::cout << static_cast<unsigned>(static_cast<unsigned char>(outChar[i])) << ' ';

Note that the two casts are to first get the unsigned byte value and then to get the unsigned value to keep the stream from treating it as a char. 请注意,两个强制转换首先要获取无符号的字节值,然后获取无符号的值,以防止流将其视为字符。 Note that your char might already be unsigned, but that's compile-dependent. 请注意,您的char可能已经是未签名的,但这取决于编译。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM