简体   繁体   English

在c ++中从unicode字符串中检测语言环境

[英]Detecting locale from unicode string in c++

I have a string and I want to check if the content is in English or Hindi(My local language). 我有一个字符串,我想检查内容是用英语还是印地语(我的本地语言)。 I figured out that the unicode range for hindi character is from U0900-U097F. 我发现印地文字符的unicode范围来自U0900-U097F。

What is the simplest way to find if the string has any characters in this range? 查找字符串是否包含此范围内的任何字符的最简单方法是什么?

I can use std::string or Glib::ustring depending on whichever is convenient. 我可以使用std :: string或Glib :: ustring,具体取决于哪个方便。

Here is how you do it with Glib::ustring : 以下是使用Glib :: ustring执行此操作的方法:

using Glib::ustring;

ustring x("सहस");    // hindi string
bool is_hindi = false;
for (ustring::iterator i = x.begin(); i != x.end(); i ++)
    if (*i >= 0x0900 && *i <= 0x097f)
        is_hindi = true;

The first step is writing a functor to tell if a given wchar_t is Hindi. 第一步是编写一个仿函数来判断给定的wchar_t是否为印地语。 This will be (derived from) a std::unary_function<wchar_t, bool> . 这将是(派生自) std::unary_function<wchar_t, bool> Implementation is trivial: return c>= 0x0900 && c < 0x980; 实现很简单: return c>= 0x0900 && c < 0x980; . The second step is using it: std::find_if(begin, end, is_hindi()) . 第二步是使用它: std::find_if(begin, end, is_hindi())

Since you'll need Unicode, you should probably use wchar_t and therefore std::wstring . 由于您需要Unicode,因此您应该使用wchar_t ,因此应该使用std::wstring Neither std::string nor GLib::ustring supports Unicode proper. std::stringGLib::ustring支持Unicode。 On some systems (Windows in particular) the implementation of wchar_t is restricted to Unicode 4 = 16 bits but that should still be enough for 99.9% of the worlds population. 在某些系统(特别是Windows)上, wchar_t的实现仅限于Unicode 4 = 16位,但对于99.9%的世界人口来说,这应该仍然足够。

You'll need to convert from/to UTF-8 on I/O, but the advantage of "one character = one wchar_t" is big. 您需要在I / O上转换为/到UTF-8,但“one character = one wchar_t”的优势很大。 For instance, std::wstring::substr() will work reasonably. 例如, std::wstring::substr()将合理地工作。 You might still have issues with "characters" like U+094B (DEVANAGARI VOWEL SIGN O), though. 但是,您可能仍然遇到像U + 094B(DEVANAGARI VOWEL SIGN O)这样的“字符”问题。 When iterating over a std::wstring, that will appear to be a character by itself, instead of a modifier. 迭代std :: wstring时,它本身就是一个字符,而不是一个修饰符。 That's still better than std::string with UTF-8, where you'd end up iterating over the individual bytes of U+094B. 这仍然比使用UTF-8的std :: string更好,你最终会迭代U + 094B的各个字节。 And to take just your original examples, none of the bytes in UTF8(U+094B) are reserved for Hindi. UTF8(U+094B)原始示例, UTF8(U+094B)中的所有字节都不为Hindi保留。

If the string is already encoded as UTF-8, I would not convert it to UTF-16 (I assume that's what MSalters calls "Unicode proper") but iterate through the UTF-8 encoded string and check whether there is a Hindi character in it. 如果字符串已经编码为UTF-8,我不会将其转换为UTF-16(我假设这是MSalters所谓的“Unicode正确”),但是迭代UTF-8编码的字符串并检查是否存在印地语字符它。

With std::string, you can easily iterate with the help of the UTF8-CPP library: - take a look at utf8::next() function, or the iterator class. 使用std :: string,您可以在UTF8-CPP库的帮助下轻松迭代: - 查看utf8 :: next()函数或迭代器类。

GLib::ustring has an iterator that seems to enable the same functionality (haven't tried it): GLib :: ustring有一个似乎启用相同功能的迭代器 (尚未尝试过):

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM