简体   繁体   English

C++17:检查文件是否编码为UTF-8

[英]C++17: Check if a file is encoded in UTF-8

I want to check whether a file is (likely) encoded in UTF-8. I don't want to use any external libraries (otherwise I would probably use Boost.Locale ), just 'plain' C++17. I need this to be cross-platform compatible, at least on MS Windows and Linux, building with Clang, GCC and MSVC.我想检查文件是否(可能)编码为 UTF-8。我不想使用任何外部库(否则我可能会使用Boost.Locale ),只是“普通”C++17。我需要它是跨平台的兼容,至少在 MS Windows 和 Linux 上,使用 Clang、GCC 和 MSVC 构建。

I am aware that such a check can only be a heuristic, since you can craft eg a ISO-8859 encoded file containing a weird combination of special charactes which yield a valid UTF-8 sequence (corresponding to probably equally weird, but different, unicode characters).我知道这样的检查只能是一种启发式的,因为你可以制作一个 ISO-8859 编码的文件,其中包含一个奇怪的特殊字符组合,产生一个有效的 UTF-8 序列(对应于可能同样奇怪但不同的 unicode 个字符) .

My best attempt so far is to use std::wstring_convert and std::codecvt<char16_t, char, std::mbstate_t> to attempt a conversion from the input data (assumed to be UTF-8) into something else (UTF-16 in this case) and handle a thrown std::range_error as "the file was not UTF-8".到目前为止,我最好的尝试是使用std::wstring_convertstd::codecvt<char16_t, char, std::mbstate_t>尝试将输入数据(假定为 UTF-8)转换为其他数据(UTF-16在这种情况下)并将抛出的std::range_error处理为“文件不是 UTF-8”。 Something like this:是这样的:

void check(const std::filesystem::path& path)
{
    std::ifstream ifs(path);

    if (!ifs)
    {
        return false;
    }

    std::string data = std::string(std::istreambuf_iterator<char>(ifs), std::istreambuf_iterator<char>());

    std::wstring_convert<deletable_facet<std::codecvt<char16_t, char, std::mbstate_t>>, char16_t>
        conv16;
    try
    {
        std::u16string str16 = conv16.from_bytes(data);
        std::cout << "Probably UTF-8\n";
    }
    catch (std::range_error&)
    {
        std::cout << "Not UTF-8!\n";
    }
}

(Note that the conversion code, as well as the not defined deletable_facet , is taken more or less verbatim from cppreference .) (请注意,转换代码以及未定义的deletable_facet或多或少是从cppreference逐字获取的。)

Is that a sensible approach?这是一个明智的做法吗? Are there better ways that do not rely on external libraries?有没有更好的不依赖外部库的方法?

The rules for UTF-8 are much more stringent than for UTF-16, and are quite easy to follow. UTF-8 的规则比 UTF-16 严格得多,而且很容易遵循。 The code below basically does BNF parsing to check the validity of a string.下面的代码主要是通过 BNF 解析来检查字符串的有效性。 If you plan to check on streams, remember that the longest UTF-8 sequence is 6 bytes long, so if an error appears less that 6 bytes before the end of a buffer, you may have a truncated symbol.如果您计划检查流,请记住最长的 UTF-8 序列有 6 个字节长,因此如果错误出现在缓冲区末尾前不到 6 个字节的位置,您可能会遇到一个截断符号。

The code below can very certainly be optimized.下面的代码肯定可以优化。 I adapted code I had on hand, but had to change some bits (the original uses a bnf library).我改编了手头的代码,但不得不更改一些位(原始代码使用 bnf 库)。 In particular, you should write your own match_one(), that would speed things up quite a bit.特别是,您应该编写自己的 match_one(),这会大大加快速度。

NOTE: the code below is backwards-compatible with RFC-2279, the precursor to the current standard (defined in RFC-3629).注意:下面的代码与 RFC-2279 向后兼容,RFC-2279 是当前标准的前身(在 RFC-3629 中定义)。 If any of the text you plan to check could have been generated by software made before 2004, then use this, else if you need more stringent testing for RFC-3679 compliance, the rules can be modified quite easily.如果您计划检查的任何文本可能是由 2004 年之前制作的软件生成的,那么请使用它,否则如果您需要更严格的 RFC-3679 合规性测试,可以很容易地修改规则。

#include <algorithm>
#include <cstddef>
#include <iostream>
#include <string>
#include <string_view>

size_t find_first_not_utf8(std::string_view s) {
    // ----------------------------------------------------
    // returns true if fn(c) returns true for all n first charac-ters c of
    // string src. the sring_voew is updated to exclude the first n characters
    // if a match is found, left untouched otherwise.
    auto match_n = [](std::string_view& src, size_t n, auto&& fn) noexcept {
        if (src.length() < n) return false;

        const auto SRC = src;
        const auto E = src.begin() + n;
        if (!std::all_of(src.begin(), E, fn)) {
            src = SRC;
            return false;
        }
        src = src.substr(n);
        return true;
    };

    // ----------------------------------------------------
    // returns true if the first chatacter sequence of src is a valid non-ascii
    // utf8 sequece.
    // the sring_view is updated to exclude the first utf-8 sequence if non-ascii
    // sequence is found, left untouched otherwise.

    auto utf8_non_ascii = [&](std::string_view& src) noexcept {
        const auto SRC = src;

        auto UTF8_CONT = [](uint8_t c) noexcept {
            return 0x80 <= c && c <= 0xBF;
        };

        if (match_n(src, 1, [](uint8_t c) { return 0xC0 <= c && c <= 0xDF; }) &&
            match_n(src, 1, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_n(src, 1, [](uint8_t c) { return 0xE0 <= c && c <= 0xEF; }) &&
            match_n(src, 2, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_n(src, 1, [](uint8_t c) { return 0xF0 <= c && c <= 0xF7; }) &&
            match_n(src, 3, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_n(src, 1, [](uint8_t c) { return 0xF8 <= c && c <= 0xFB; }) &&
            match_n(src, 4, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        if (match_n(src, 1, [](uint8_t c) { return 0xFC <= c && c <= 0xFD; }) &&
            match_n(src, 5, UTF8_CONT)) {
            return true;
        }
        src = SRC;
        return false;
    };

    // ----------------------------------------------------
    // returns true if the first symbol of st(ring src is a valid UTF8 character
    // not-including control characters, nor space.
    // the sring_view is updated to exclude the first utf-8 sequence
    // if a valid symbol sequence is found, left untouched otherwise.

    auto utf8_char = [&](std::string_view& src) noexcept {
        auto rule = [](uint8_t c) noexcept -> bool {
            return (0x21 <= c && c <= 0x7E) || std::isspace(c);
        };

        const auto SRC = src;

        std::optional<std::string_view> result;
        if (match_n(src, 1, rule)) return true;
        s = SRC;
        return utf8_non_ascii(src);
    };

    const auto S = s;

    while (!s.empty() && utf8_char(s)) {
    }

    if (s.empty()) return std::string_view::npos;

    return size_t(s.data() - S.data());
}

void test(const std::string s) {
    std::cout << "testing \'" << s << "\": ";

    auto pos = find_first_not_utf8(s);

    if (pos < s.length())
        std::cout << "failed at offset " << pos << "\n";
    else
        std::cout << "OK\n";
}

auto greek = "Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν, ὦ ἄνδρες ᾿Αθηναῖοι\n ὅταν τ᾿ εἰς τὰ πράγματα ἀποβλέψω καὶ ὅταν πρὸς τοὺς ";
auto ethiopian = "ሰማይ አይታረስ ንጉሥ አይከሰስ።";

const char* errors[] = {
    "2-byte sequence with last byte missing (U+0000):   \xC0xyz",
    "3-byte sequence with last byte missing (U+0000):   \xe0\x81xyz",
    "4-byte sequence with last byte missing (U+0000):   \xF0\x83\x80xyz",
    "5-byte sequence with last byte missing (U+0000):   \xF8\x81\x82\x83xyz",
    "6-byte sequence with last byte missing (U+0000):   \xFD\x81\x82\x83\x84xyz"
};

int main() {
    test("hello world");
    test(greek);
    test(ethiopian);

    for (auto& e : errors) test(e);
    return 0;
}

You'll be able to play with the code here: https://godbolt.org/z/vG8ffacPT您可以在此处使用代码: https://godbolt.org/z/vG8ffacPT

Just use ICU只需使用 ICU

It exists (that is, it is already installed and in use ) on every major modern OS you care about [ citation needed ] .它在您关心的每个主要现代操作系统上都存在(也就是说,它已经安装并在使用中[需要引用] It's there.在那。 Use it.用它。

The good news is, for what you want to do, you don't even have to link with ICU ⟶ No extra magic compilation flags necessary!好消息是,对于您想做的事情,您甚至不必链接 ICU ⟶ 不需要额外的魔法编译标志!

This should compile with anything (modern) you've got:这应该可以与您拥有的任何(现代)东西一起编译:

#include <string>

#include <unicode/utf8.h>

bool is_utf8( const char * s, size_t n )
{
  if (!*s) return true; // empty strings are UTF-8 encoded
  UChar32 c = 0;
  int32_t i = 0;
  do { U8_INTERNAL_NEXT_OR_SUB( s, i, (int32_t)n, c, 0 ); }
  while (c and (i < (int32_t)n));
  return !!c;
}

bool is_utf8( const std::string & s )
{
  return is_utf8( s.c_str(), s.size() );
}

If you are using MSVC's C++17 or earlier, you'll want to add an #include <ciso646> above that.如果您使用的是 MSVC 的 C++17 或更早版本,您需要在其上方添加#include <ciso646>

Example program:示例程序:

#include <fstream>
#include <iostream>
#include <sstream>

auto file_to_string( const std::string & filename )
{
  std::ifstream f( filename, std::ios::binary );
  std::ostringstream ss;
  ss << f.rdbuf();
  return ss.str();
}

auto ask( const std::string & prompt )
{
  std::cout << prompt;
  std::string s;
  getline( std::cin, s );
  return s;
}

int main( int, char ** argv )
{
  std::string filename = argv[1] ? argv[1] : ask( "filename? " );
  std::cout << (is_utf8( file_to_string( filename ) )
    ? "UTF-8 encoded\n"
    : "Unknown encoding\n");
}

Tested with (Windows) MSVC, Clang/LLVM, MinGW-w64, TDM and (Linux) GCC, Clang.使用 (Windows) MSVC、Clang/LLVM、MinGW-w64、TDM 和 (Linux) GCC、Clang 进行测试。

  • cl /EHsc /W4 /Ox /std:c++17 isutf8.cpp
  • clang++ -Wall -Wextra -Werror -pedantic-errors -O3 -std=c++17 isutf8.cpp

(My copy of TDM is a little out of date. I also had to tell it where to find the ICU headers.) (我的 TDM 副本有点过时了。我还必须告诉它在哪里可以找到 ICU 标头。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM