简体   繁体   中英

Reading UTF-16 file in c++

I'm trying to read a file which has UTF-16LE coding with BOM. I tried this code

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main() {

  std::wifstream fin("/home/asutp/test");
  fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
  if (!fin) {
    std::cout << "!fin" << std::endl;
    return 1;
  }
  if (fin.eof()) {
    std::cout << "fin.eof()" << std::endl;
    return 1;
  }
  std::wstring wstr;
  getline(fin, wstr);
  std::wcout << wstr << std::endl;

  if (wstr.find(L"Test") != std::string::npos) {
    std::cout << "Found" << std::endl;
  } else {
    std::cout << "Not found" << std::endl;
  }

  return 0;
}

The file can contain Latin and Cyrillic. I created the file with a string "Test тест". And this code returns me

/home/asutp/CLionProjects/untitled/cmake-build-debug/untitled

Not found

Process finished with exit code 0

I'm on Linux Mint 18.3 x64, Clion 2018.1

Tried

  • gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
  • clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
  • clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final)

Ideally you should save files in UTF8, because Window has much better UTF8 support (aside from displaying Unicode in console window), while POSIX has limited UTF16 support. Even Microsoft products favor UTF8 for saving files in Windows.

As an alternative, you can read the UTF16 file in to a buffer and convert that to UTF8

std::ifstream fin("utf16.txt", std::ios::binary);
fin.seekg(0, ios::end);
size_t size = (size_t)fin.tellg();

//skip BOM
fin.seekg(2, ios::beg);
size -= 2;

std::u16string u16((size / 2) + 1, '\0');
fin.read((char*)&u16[0], size);

std::string utf8 = std::wstring_convert<
    std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(u16);


Or

 std::ifstream fin("utf16.txt", std::ios::binary); //skip BOM fin.seekg(2); //read as raw bytes std::stringstream ss; ss << fin.rdbuf(); std::string bytes = ss.str(); //make sure len is divisible by 2 int len = bytes.size(); if(len % 2) len--; std::wstring sw; for(size_t i = 0; i < len;) { //little-endian int lo = bytes[i++] & 0xFF; int hi = bytes[i++] & 0xFF; sw.push_back(hi << 8 | lo); } std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert; std::string utf8 = convert.to_bytes(sw); 

Replace by this - std::wstring::npos (not std::string::npos ) -, and your code must work :

...
 //std::wcout << wstr << std::endl;

  if (wstr.find(L"Test") == std::wstring::npos) {
    std::cout << "Not Found" << std::endl;
  } else {
    std::cout << "found" << std::endl;
  } 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM