简体   繁体   中英

Lexicographical sorting for non-ascii characters

I have done lexicographical sorting for ascii characters by the following code:

std::ifstream infile;
std::string line, new_line;
std::vector<std::string> v;
while(std::getline(infile, line))
            {
                // If line is empty, ignore it
                if(line.empty())
                    continue;
                new_line = line + "\n";
                // Line contains string of length > 0 then save it in vector
                if(new_line.size() > 0)
                    v.push_back(new_line);
            }   
sort(v.begin(), v.end());

The result should be: a aahr abyutrw bb bhehjr cgh cuttrew....

But I don't know how to do Lexicographical sorting for both ascii and non-ascii characters in the order like this: a A À Á Ã brg Baq ckrwg CkfgF d Dgrn... Please tell me how to write code for it. Thank you!

The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Characters like À, Á, and  are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like eg Windows 1252 . Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (ie number) in any encoding. (Most of the characters have no number in most encodings.)

However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe). There are implementations as

  • UTF-8 where code points are represented by 1 or more 8 bit values (storage with char )
  • UTF-16 where code points are represented with 1 or 2 16 bit values (storage with std::char16_t or, maybe, wchar_t )
  • UTF-32 where code points are represented with 1 32 bit value (storage with std::char32_t or, maybe, wchar_t if it has sufficient size).

Concerning the size of wchar_t : Character types .

Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.


The order used in std::sort() to sort a range of T elements is defined by default with
bool < operator(const T&, const T&) the < operator for T .
However, there are flavors of std::sort() to define a custom predicate instead.

The custom predicate must match the signature and must provide a strict weak ordering relation .

Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.

This is the predicate, I used in my sample:

  // sort words
  auto charIndex = [&mapChars](wchar_t chr)
  {
    const CharMap::const_iterator iter = mapChars.find(chr);
    return iter != mapChars.end()
      ? iter->second
      : (CharMap::mapped_type)mapChars.size();
  };

  auto pred
    = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
  {
    const size_t len = std::min(word1.size(), word2.size());
    // + 1 to include zero terminator
    for (size_t i = 0; i < len; ++i) {
      const wchar_t chr1 = word1[i], chr2 = word2[i];
      const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
      if (i1 != i2) return i1 < i2;
    }
    return word1.size() < word2.size();
  };

  std::sort(words.begin(), words.end(), pred);

From bottom to top:

  1. std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order.
  2. The lambda pred() , compares two std::wstring s character by character. Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned ie a character to its rank in my order.
  3. The mapChars stores only a selection of all character values. Hence, the character in quest might not be found in the mapChars . To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.

The type CharMap is simply a typedef :

typedef std::map<wchar_t, unsigned> CharMap;

To initialize a CharMap , a function is used:

CharMap makeCharMap(const wchar_t *table[], size_t size)
{
  CharMap mapChars;
  unsigned rank = 0;
  for (const wchar_t **chars = table; chars != table + size; ++chars) {
    for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
    ++rank;
  }
  return mapChars;
}

It has to be called with an array of strings which contains all groups of characters in the intended order:

const wchar_t *table[] = {
  L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

The complete sample:

#include <string>
#include <sstream>
#include <vector>

static const wchar_t *table[] = {
  L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

static const wchar_t *tableGerman[] = {
  L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
  L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};

typedef std::map<wchar_t, unsigned> CharMap;

// fill a look-up table to map characters to the corresponding rank
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
  CharMap mapChars;
  unsigned rank = 0;
  for (const wchar_t **chars = table; chars != table + size; ++chars) {
    for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
    ++rank;
  }
  return mapChars;
}

// conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597
// needed to print to console
// Please, note: std::codecvt_utf8() is deprecated in C++17. :-(
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;

// collect words and sort accoring to table
void printWordsSorted(
  const std::wstring &text, const wchar_t *table[], const size_t size)
{
  // make look-up table
  const CharMap mapChars = makeCharMap(table, size);
  // strip punctuation and other noise
  std::wstring textClean;
  for (const wchar_t chr : text) {
    if (chr == ' ' || mapChars.find(chr) != mapChars.end()) {
      textClean += chr;
    }
  }
  // fill word list with sample text
  std::vector<std::wstring> words;
  for (std::wistringstream in(textClean);;) {
    std::wstring word;
    if (!(in >> word)) break; // bail out
    // store word
    words.push_back(word);
  }
  // sort words
  auto charIndex = [&mapChars](wchar_t chr)
  {
    const CharMap::const_iterator iter = mapChars.find(chr);
    return iter != mapChars.end()
      ? iter->second
      : (CharMap::mapped_type)mapChars.size();
  };
  auto pred
    = [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
  {
    const size_t len = std::min(word1.size(), word2.size());
    // + 1 to include zero terminator
    for (size_t i = 0; i < len; ++i) {
      const wchar_t chr1 = word1[i], chr2 = word2[i];
      const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
      if (i1 != i2) return i1 < i2;
    }
    return word1.size() < word2.size();
  };
  std::sort(words.begin(), words.end(), pred);
  // remove duplicates
  std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end());
  words.erase(last, words.end());
  // print result
  for (const std::wstring &word : words) {
    std::cout << utf8_conv.to_bytes(word) << '\n';
  }
}

template<typename T, size_t N>
size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; }

int main()
{
  // a sample string
  std::wstring sampleText
    = L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)"
      L" have the same lexicographical rank as their counterparts a, o, and u.\n";
  std::cout << "Sample text:\n"
    << utf8_conv.to_bytes(sampleText) << '\n';
  // sort like requested by OP
  std::cout << "Words of text sorted as requested by OP:\n";
  printWordsSorted(sampleText, table, size(table));
  // sort like correct in German
  std::cout << "Words of text sorted as usual in German language:\n";
  printWordsSorted(sampleText, tableGerman, size(tableGerman));
}

Output:

Words of text sorted as requested by OP:
a
and
as
ä
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
umlaut
ü
Words of text sorted as usual in German language:
ä
a
and
as
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
ü
umlaut

Live Demo on coliru

Note:

My original intention was to do the output with std::wcout . This didn't work correctly for ä, ö, ü. Hence, I looked up a simple way to convert wstring s to UTF-8 . I already knew that UTF-8 is supported in coliru.


@Phil1970 reminded me that I forgot to mention something else:

Sorting of strings (according to “human dictionary” order) is usually provided by std::locale . std::collate provides a locale dependent lexicographical ordering of strings.

The locale plays a role because the order of characters might vary with distinct locales. The std::collate doc. has a nice example for this:

Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp

Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like eg libiconv .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM