简体   繁体   English

有没有办法检测 c++ 中的汉字? (使用升压)

[英]is there a way to detect chinese characters in c++ ? (using boost)

In a data processing project, i need to detect split words in chinese ( words in chinese dont contain spaces).在一个数据处理项目中,我需要检测中文中的拆分词(中文中的词不包含空格)。 Is there a way to detect chinese characters using a native c++ feature or boost.locale library?有没有办法使用原生 c++ 功能或 boost.locale 库来检测汉字?

Generally speaking, if you want full Unicode support in C++, there is little to no way around ICU .一般来说,如果你想在 C++ 中完全支持 Unicode,那么ICU几乎没有办法。 Boost provides some access to its features (through Boost.Locale and Boost.Regex), but it requires Boost to be compiled with ICU support for this. Boost 提供了一些对其功能的访问(通过 Boost.Locale 和 Boost.Regex),但它需要在 ICU 支持的情况下编译 Boost。 So instead of making sure the Boost of the target platform is compiled thusly you are probably better off using the ICU API directly.因此,与其确保编译目标平台的 Boost,不如直接使用 ICU API。

If you are looking for word boundaries, icu::BreakIterator (more specifically, icu::BreakIterator::createWordInstance ) is the starting point.如果您正在寻找单词边界, icu::BreakIterator (更具体地说, icu::BreakIterator::createWordInstance )是起点。 You then pass the text to be iterated over via setText and move the iterator via next et al.然后,您通过setText传递要迭代的文本,并通过next等移动迭代器。 (yes, ICU is a bit non-idiomatic this way, as it originated in Java land). (是的,ICU 这种方式有点不习惯,因为它起源于 Java 土地)。

Alternatively, if you don't want to go for the full C++ API, there's ublock_getCode which will tell you the UBlockCode of the code point in question.或者,如果您不想 go 获取完整的 C++ API,则有问题中的ublock_getCode会告诉您代码点的UBlockCode

Here is my attempt using only boost and standard library:这是我仅使用 boost 和标准库的尝试:

#include <iostream>
#include <boost/regex/pending/unicode_iterator.hpp>
#include <functional>
#include <algorithm>

using Iter = boost::u8_to_u32_iterator<std::string::const_iterator>;

template <::boost::uint32_t a, ::boost::uint32_t b>
class UnicodeRange
{
    static_assert(a <= b, "Proper range");
public:
    constexpr bool operator()(::boost::uint32_t x) const noexcept
    {
        return x >= a && x <= b;
    }
};

using UnifiedIdeographs = UnicodeRange<0x4E00, 0x9FFF>;
using UnifiedIdeographsA = UnicodeRange<0x3400, 0x4DBF>;
using UnifiedIdeographsB = UnicodeRange<0x20000, 0x2A6DF>;
using UnifiedIdeographsC = UnicodeRange<0x2A700, 0x2B73F>;
using UnifiedIdeographsD = UnicodeRange<0x2B740, 0x2B81F>;
using UnifiedIdeographsE = UnicodeRange<0x2B820, 0x2CEAF>;
using CompatibilityIdeographs = UnicodeRange<0xF900, 0xFAFF>;
using CompatibilityIdeographsSupplement = UnicodeRange<0x2F800, 0x2FA1F>;

constexpr bool isChineese(::boost::uint32_t x) noexcept
{
    return UnifiedIdeographs{}(x) 
    || UnifiedIdeographsA{}(x) || UnifiedIdeographsB{}(x) || UnifiedIdeographsC{}(x) 
    || UnifiedIdeographsD{}(x) || UnifiedIdeographsE{}(x)
    || CompatibilityIdeographs{}(x) || CompatibilityIdeographsSupplement{}(x);
}

int main()
{
    std::string s;
    while (std::getline(std::cin, s))
    {
        auto start = std::find_if(Iter{s.cbegin()}, Iter{s.cend()}, isChineese);
        auto stop = std::find_if_not(start, Iter{s.cend()}, isChineese);
        std::cout << std::string{start.base(), stop.base()} << '\n';
    }
    
    return 0;
}

https://wandbox.org/permlink/FtxKa8D2LtR3ko9t https://wandbox.org/permlink/FtxKa8D2LtR3ko9t

Probably you should be able to polish that approach to something fully functional.也许您应该能够将这种方法改进为功能齐全的东西。 I do not know how to properly cover this by tests and not sure which characters should be included in this check.我不知道如何通过测试正确覆盖这一点,并且不确定哪些字符应包含在此检查中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM