简体   繁体   中英

How can Perl and Unix sort, order Unicode strings in the same sequence?

I am trying to get Perl and the GNU/Linux sort (1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8 . In the Perl program I have tried the following methods:

Each one of them failed with the following errors (from the Perl side):

  • Input is not sorted: [----,] came after [($1]
  • Input is not sorted: [...] came after [&]
  • Input is not sorted: [($1] came after [1]

The only method that worked for me involved setting LC_ALL=C for sort , and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.

Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale; is for.

I don't know why you didn't get the desired order out of cmp under use locale; .

You could process the decompressed files.

for q in file1.uniqc file2.uniqc ; do
   perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q"
done | sort | uniq -c

It'll require more temporary storage, of course, but you'll get exactly the order you want.


I found a case use locale; didn't cause Perl's sort / cmp to give the same result as the sort utility. Weird.

$ export LC_COLLATE=en_US.UTF-8

$ perl -Mlocale -e'print for sort { $a cmp $b } <>' data
(
($1
1

$ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data
(
($1
1

$ sort data
(
1
($1

Truth be told, it's the sort utility that's weird.


In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort utility through IPC::Run3 , but it sounds like that's not guaranteed to always result in the same order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM