简体   繁体   English

非ASCII字符的字典排序

[英]Lexicographical sorting for non-ascii characters

I have done lexicographical sorting for ascii characters by the following code:我通过以下代码对 ascii 字符进行了字典排序:

std::ifstream infile;
std::string line, new_line;
std::vector<std::string> v;
while(std::getline(infile, line))
            {
                // If line is empty, ignore it
                if(line.empty())
                    continue;
                new_line = line + "\n";
                // Line contains string of length > 0 then save it in vector
                if(new_line.size() > 0)
                    v.push_back(new_line);
            }   
sort(v.begin(), v.end());

The result should be: a aahr abyutrw bb bhehjr cgh cuttrew....结果应该是: aahr abyutrw bb bhehjr cgh cuttrew....

But I don't know how to do Lexicographical sorting for both ascii and non-ascii characters in the order like this: a A À Á Ã brg Baq ckrwg CkfgF d Dgrn... Please tell me how to write code for it.但我不知道如何按如下顺序对 ascii 和非 ascii 字符进行字典排序:a A À Á Ã brg Baq ckrwg CkfgF d Dgrn ... 请告诉我如何为它编写代码。 Thank you!谢谢!

The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well. OP 没有,但我觉得值得一提:说到非 ASCII 字符,也应该考虑编码。

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最小值(没有借口!)

Characters like À, Á, and  are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like eg Windows 1252 . À、Á 和  等字符不是7 位 ASCII的一部分,但在各种 8 位编码中被考虑在内,例如Windows 1252 Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (ie number) in any encoding.因此,不允许某个字符(不是 ASCII 的一部分)在任何编码中具有相同的代码点(即数字)。 (Most of the characters have no number in most encodings.) (大多数字符在大多数编码中都没有数字。)

However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe).但是, Unicode提供了一个唯一的编码表,其中包含任何其他编码的所有字符(我相信)。 There are implementations as有如下实现

  • UTF-8 where code points are represented by 1 or more 8 bit values (storage with char ) UTF-8其中代码点由 1 个或多个 8 位值表示(使用char存储)
  • UTF-16 where code points are represented with 1 or 2 16 bit values (storage with std::char16_t or, maybe, wchar_t ) UTF-16 ,其中代码点用 1 个或 2 个 16 位值表示(使用std::char16_twchar_t存储)
  • UTF-32 where code points are represented with 1 32 bit value (storage with std::char32_t or, maybe, wchar_t if it has sufficient size). UTF-32 ,其中代码点用 1 个 32 位值表示(使用std::char32_t存储,或者,如果它有足够的大小,可能使用wchar_t存储)。

Concerning the size of wchar_t : Character types .关于wchar_t的大小: 字符类型

Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.话虽如此,我在示例中使用了wchar_tstd::wstring来使变音符号语言环境和平台的使用独立。


The order used in std::sort() to sort a range of T elements is defined by default with std::sort()中用于对一系列T元素进行排序的顺序默认定义为
bool < operator(const T&, const T&) the < operator for T . bool < operator(const T&, const T&) T <运算符。
However, there are flavors of std::sort() to define a custom predicate instead.但是,有一些std::sort()的风格来定义自定义谓词。

The custom predicate must match the signature and must provide a strict weak ordering relation .自定义谓词必须匹配签名并且必须提供严格的弱排序关系

Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.因此,我建议使用std::map将字符映射到导致预期顺序的索引。

This is the predicate, I used in my sample:这是我在示例中使用的谓词:

  // sort words
  auto charIndex = [&mapChars](wchar_t chr)
  {
    const CharMap::const_iterator iter = mapChars.find(chr);
    return iter != mapChars.end()
      ? iter->second
      : (CharMap::mapped_type)mapChars.size();
  };

  auto pred
    = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
  {
    const size_t len = std::min(word1.size(), word2.size());
    // + 1 to include zero terminator
    for (size_t i = 0; i < len; ++i) {
      const wchar_t chr1 = word1[i], chr2 = word2[i];
      const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
      if (i1 != i2) return i1 < i2;
    }
    return word1.size() < word2.size();
  };

  std::sort(words.begin(), words.end(), pred);

From bottom to top:从下到上:

  1. std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order.使用第三个参数调用,该参数为我的自定义订单提供谓词pred
  2. The lambda pred() , compares two std::wstring s character by character. lambda pred()逐字符比较两个std::wstring Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned ie a character to its rank in my order.因此,比较是使用std::map mapChars的,它将wchar_t映射到unsigned即字符到我的顺序中的等级。
  3. The mapChars stores only a selection of all character values. mapChars仅存储所有字符值的选择。 Hence, the character in quest might not be found in the mapChars .因此,在mapChars中可能找不到 quest 中的角色。 To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.为了处理这个问题,使用了一个帮助程序 lambda charIndex() ,它在这种情况下返回mapChars.size() - 它被授予高于所有出现的索引。

The type CharMap is simply a typedef : CharMap类型只是一个typedef

typedef std::map<wchar_t, unsigned> CharMap;

To initialize a CharMap , a function is used:要初始化CharMap ,使用 function:

CharMap makeCharMap(const wchar_t *table[], size_t size)
{
  CharMap mapChars;
  unsigned rank = 0;
  for (const wchar_t **chars = table; chars != table + size; ++chars) {
    for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
    ++rank;
  }
  return mapChars;
}

It has to be called with an array of strings which contains all groups of characters in the intended order:必须使用包含按预期顺序的所有字符组的字符串数组来调用它:

const wchar_t *table[] = {
  L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

The complete sample:完整样本:

#include <string>
#include <sstream>
#include <vector>

static const wchar_t *table[] = {
  L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

static const wchar_t *tableGerman[] = {
  L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

typedef std::map<wchar_t, unsigned> CharMap;

// fill a look-up table to map characters to the corresponding rank
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
  CharMap mapChars;
  unsigned rank = 0;
  for (const wchar_t **chars = table; chars != table + size; ++chars) {
    for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
    ++rank;
  }
  return mapChars;
}

// conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597
// needed to print to console
// Please, note: std::codecvt_utf8() is deprecated in C++17. :-(
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;

// collect words and sort accoring to table
void printWordsSorted(
  const std::wstring &text, const wchar_t *table[], const size_t size)
{
  // make look-up table
  const CharMap mapChars = makeCharMap(table, size);
  // strip punctuation and other noise
  std::wstring textClean;
  for (const wchar_t chr : text) {
    if (chr == ' ' || mapChars.find(chr) != mapChars.end()) {
      textClean += chr;
    }
  }
  // fill word list with sample text
  std::vector<std::wstring> words;
  for (std::wistringstream in(textClean);;) {
    std::wstring word;
    if (!(in >> word)) break; // bail out
    // store word
    words.push_back(word);
  }
  // sort words
  auto charIndex = [&mapChars](wchar_t chr)
  {
    const CharMap::const_iterator iter = mapChars.find(chr);
    return iter != mapChars.end()
      ? iter->second
      : (CharMap::mapped_type)mapChars.size();
  };
  auto pred
    = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
  {
    const size_t len = std::min(word1.size(), word2.size());
    // + 1 to include zero terminator
    for (size_t i = 0; i < len; ++i) {
      const wchar_t chr1 = word1[i], chr2 = word2[i];
      const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
      if (i1 != i2) return i1 < i2;
    }
    return word1.size() < word2.size();
  };
  std::sort(words.begin(), words.end(), pred);
  // remove duplicates
  std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end());
  words.erase(last, words.end());
  // print result
  for (const std::wstring &word : words) {
    std::cout << utf8_conv.to_bytes(word) << '\n';
  }
}

template<typename T, size_t N>
size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; }

int main()
{
  // a sample string
  std::wstring sampleText
    = L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)"
      L" have the same lexicographical rank as their counterparts a, o, and u.\n";
  std::cout << "Sample text:\n"
    << utf8_conv.to_bytes(sampleText) << '\n';
  // sort like requested by OP
  std::cout << "Words of text sorted as requested by OP:\n";
  printWordsSorted(sampleText, table, size(table));
  // sort like correct in German
  std::cout << "Words of text sorted as usual in German language:\n";
  printWordsSorted(sampleText, tableGerman, size(tableGerman));
}

Output: Output:

Words of text sorted as requested by OP:
a
and
as
ä
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
umlaut
ü
Words of text sorted as usual in German language:
ä
a
and
as
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
ü
umlaut

Live Demo on colirucoliru 现场演示

Note:笔记:

My original intention was to do the output with std::wcout .我的初衷是用std::wcout This didn't work correctly for ä, ö, ü.这对ä、ö、ü 不起作用。 Hence, I looked up a simple way to convert wstring s to UTF-8 .因此,我查找了一种wstring转换为 UTF-8 的简单方法 I already knew that UTF-8 is supported in coliru.我已经知道 coliru 支持 UTF-8。


@Phil1970 reminded me that I forgot to mention something else: @Phil1970提醒我忘记提及其他内容:

Sorting of strings (according to “human dictionary” order) is usually provided by std::locale .字符串的排序(根据“人类字典”顺序)通常由std::locale提供。 std::collate provides a locale dependent lexicographical ordering of strings. std::collate提供依赖于语言环境的字符串字典顺序。

The locale plays a role because the order of characters might vary with distinct locales.语言环境起着重要作用,因为字符的顺序可能因不同的语言环境而异。 The std::collate doc. std::collate文档。 has a nice example for this:有一个很好的例子:

Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp

Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. UTF-16 ⇔ UTF-32 ⇔ UTF-8 的转换可以仅通过位运算来实现。 For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like eg libiconv .对于任何其他编码的转换(排除 ASCII,它是 Unicode 的一个子集),我会推荐一个像libiconv这样的库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM