在c ++中从unicode字符串中检测语言环境

Question

I have a string and I want to check if the content is in English or Hindi(My local language). 我有一个字符串，我想检查内容是用英语还是印地语（我的本地语言）。 I figured out that the unicode range for hindi character is from U0900-U097F. 我发现印地文字符的unicode范围来自U0900-U097F。

What is the simplest way to find if the string has any characters in this range? 查找字符串是否包含此范围内的任何字符的最简单方法是什么？

I can use std::string or Glib::ustring depending on whichever is convenient. 我可以使用std :: string或Glib :: ustring，具体取决于哪个方便。

Answer 1

Here is how you do it with Glib::ustring : 以下是使用Glib :: ustring执行此操作的方法：

using Glib::ustring;

ustring x("सहस");    // hindi string
bool is_hindi = false;
for (ustring::iterator i = x.begin(); i != x.end(); i ++)
    if (*i >= 0x0900 && *i <= 0x097f)
        is_hindi = true;

Answer 2

The first step is writing a functor to tell if a given wchar_t is Hindi. 第一步是编写一个仿函数来判断给定的wchar_t是否为印地语。 This will be (derived from) a std::unary_function<wchar_t, bool> . 这将是（派生自） std::unary_function<wchar_t, bool> 。 Implementation is trivial: return c>= 0x0900 && c < 0x980; 实现很简单： return c>= 0x0900 && c < 0x980; . 。 The second step is using it: std::find_if(begin, end, is_hindi()) . 第二步是使用它： std::find_if(begin, end, is_hindi()) 。

Since you'll need Unicode, you should probably use wchar_t and therefore std::wstring . 由于您需要Unicode，因此您应该使用wchar_t ，因此应该使用std::wstring 。 Neither std::string nor GLib::ustring supports Unicode proper. std::string和GLib::ustring支持Unicode。 On some systems (Windows in particular) the implementation of wchar_t is restricted to Unicode 4 = 16 bits but that should still be enough for 99.9% of the worlds population. 在某些系统（特别是Windows）上， wchar_t的实现仅限于Unicode 4 = 16位，但对于99.9％的世界人口来说，这应该仍然足够。

You'll need to convert from/to UTF-8 on I/O, but the advantage of "one character = one wchar_t" is big. 您需要在I / O上转换为/到UTF-8，但“one character = one wchar_t”的优势很大。 For instance, std::wstring::substr() will work reasonably. 例如， std::wstring::substr()将合理地工作。 You might still have issues with "characters" like U+094B (DEVANAGARI VOWEL SIGN O), though. 但是，您可能仍然遇到像U + 094B（DEVANAGARI VOWEL SIGN O）这样的“字符”问题。 When iterating over a std::wstring, that will appear to be a character by itself, instead of a modifier. 迭代std :: wstring时，它本身就是一个字符，而不是一个修饰符。 That's still better than std::string with UTF-8, where you'd end up iterating over the individual bytes of U+094B. 这仍然比使用UTF-8的std :: string更好，你最终会迭代U + 094B的各个字节。 And to take just your original examples, none of the bytes in UTF8(U+094B) are reserved for Hindi. UTF8(U+094B)原始示例， UTF8(U+094B)中的所有字节都不为Hindi保留。

Answer 3

If the string is already encoded as UTF-8, I would not convert it to UTF-16 (I assume that's what MSalters calls "Unicode proper") but iterate through the UTF-8 encoded string and check whether there is a Hindi character in it. 如果字符串已经编码为UTF-8，我不会将其转换为UTF-16（我假设这是MSalters所谓的“Unicode正确”），但是迭代UTF-8编码的字符串并检查是否存在印地语字符它。

With std::string, you can easily iterate with the help of the UTF8-CPP library: - take a look at utf8::next() function, or the iterator class. 使用std :: string，您可以在UTF8-CPP库的帮助下轻松迭代： - 查看utf8 :: next（）函数或迭代器类。

GLib::ustring has an iterator that seems to enable the same functionality (haven't tried it): GLib :: ustring有一个似乎启用相同功能的迭代器（尚未尝试过）：

在c ++中从unicode字符串中检测语言环境

问题描述

3 个解决方案

解决方案1
2 已采纳 2009-08-17 16:50:21

解决方案2
1 2009-08-17 13:46:31

解决方案3
1 2009-08-17 16:43:38

在c ++中从unicode字符串中检测语言环境

问题描述

3 个解决方案

解决方案1 2 已采纳 2009-08-17 16:50:21

解决方案2 1 2009-08-17 13:46:31

解决方案3 1 2009-08-17 16:43:38

解决方案1
2 已采纳 2009-08-17 16:50:21

解决方案2
1 2009-08-17 13:46:31

解决方案3
1 2009-08-17 16:43:38