简体   繁体   中英

Writing std::string with non-ascii data to file

Below is a simplified example of my problem. I have some external byte data which appears to be a string with cp1252 encoded degree symbol 0xb0 . When it is stored in my program as an std::string it is correctly represented as 0xffffffb0 . However, when that string is then written to a file, the resulting file is only one byte long with just 0xb0 . How do I write the string to the file? How does the concept of UTF-8 come into this?

#include <iostream>
#include <fstream>

typedef struct
{
  char n[40];
} mystruct;

static void dump(const std::string& name)
{
  std::cout << "It is '" << name << "'" << std::endl;
  const char *p = name.data();
  for (size_t i=0; i<name.size(); i++)
  {
    printf("0x%02x ", p[i]);
  }
  std::cout << std::endl;
}

int main()
{
  const unsigned char raw_bytes[] = { 0xb0, 0x00};
  mystruct foo;
  foo = *(mystruct *)raw_bytes;
  std::string name = std::string(foo.n);
  dump(name);

  std::ofstream my_out("/tmp/out.bin", std::ios::out | std::ios::binary);
  my_out << name;
  my_out.close();

  return 0;
}

Running the above program produces the following on STDOUT

It is '�'
0xffffffb0 

First of all, this is a must read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Now, when you done with that, you have to understand what type represents p[i] .

It is char , which in C is a small size integer value with a sign ! char can be negative!

Now, since you have cp1252 characters, they are outside the scope of ASCII. This means these characters are seen as negative values!

Now, when they are converted to int , the sign bit is replicated, and when you are trying to print it, you will see 0xffffff<actual byte value> .

To handle that in C , first you should cast to unsigned char :

printf("0x%02x ", (unsigned char)p[i]);

then the default conversion will fill in the missing bits with zeros and printf() will give you a proper value.

Now, in C++ this is a bit more nasty, since char and unsigned char are treated by stream operators as a character representation. So to print them in hex manner, it should be like this:

int charToInt(char ch) 
{
    return static_cast<int>(static_cast<unsigned char>(ch));
}

std::cout << std::hex << charToInt(s[i]);

Now, direct conversion from char to unsigned int will not fix the problem since silently the compiler will perform a conversation to int first.

See here: https://wandbox.org/permlink/sRmh8hZd78Oar7nF

UTF-8 has nothing to this issue.

Off-topic: please, when you write pure C++ code, do not use C . It is pointless and makes code harder to maintain, and it is not faster. So:

  • do not use char* or char[] to store strings. Just use std::string .
  • do not use printf() , use std::cout (or the fmt library, if you like format strings - it will became a future C++ standard).
  • do not use alloc() , malloc() , free() - in modern C++, use std::make_unique() and std::make_shared() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM