简体   繁体   中英

Sorting UTF-8 strings?

My std::strings are encoded in UTF-8 so the std::string < operator doesn't cut it. How could I compare 2 utf-8 encoded std::strings?

where it does not cut it is for accents, é comes after z which it should not

Thanks

The standard has std::locale for locale-specific things such as collation (sorting). If the environment contains LC_COLLATE=en_US.utf8 or similar, this program will sort lines as desired.

#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
class collate_in : public std::binary_function<std::string, std::string, bool> {
  protected:
    const std::collate<char> &coll;
  public:
    collate_in(std::locale loc)
        : coll(std::use_facet<std::collate<char> >(loc)) {}
    bool operator()(const std::string &a, const std::string &b) const {
        // std::collate::compare() takes C-style string (begin, end)s and
        // returns values like strcmp or strcoll.  Compare to 0 for results
        // expected for a less<>-style comparator.
        return coll.compare(a.c_str(), a.c_str() + a.size(),
                            b.c_str(), b.c_str() + b.size()) < 0;
    }
};
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    // std::locale("") is the locale from the environment.  One could also
    // std::locale::global(std::locale("")) to set up this program's global
    // first, and then use locale() to get the global locale, or choose a
    // specific locale instead of using the environment's.
    sort(v.begin(), v.end(), collate_in(std::locale("")));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}
$ cat >file
f
é
e
d
^D
$ LC_COLLATE=C ./a.out file
d
e
f
é
$ LC_COLLATE=en_US.utf8 ./a.out file
d
e
é
f

It's been brought to my attention that std::locale::operator()(a, b) exists, obviating the std::collate<>::compare(a, b) < 0 wrapper I wrote above.

#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    sort(v.begin(), v.end(), std::locale(""));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

If you don't want a lexicographic ordering (which is what sorting the UTF-8 encoded strings lexicographically will give you), then you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as appropriate, and apply a suitable comparison function of your choosing.

To reiterate the point, the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.

Update: Your updated question indicates that you want a more complex comparison function than purely a lexicographic sort. You will need to decode your UTF-8 strings and compare the decoded characters.

Encoding (UTF-8, 16, etc) isn't the problem, it's whether the container itself is treating the string as Unicode string or 8-bit (ASCII or Latin-1) string that matters.

I found Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library , which could help you.

One option would be to use ICU collators ( http://userguide.icu-project.org/collation/api ) which provide a properly internationalized "compare" method that you can then use to sort.

Chromium has a small wrapper that should be easy to copy&paste/reuse

https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM