简体   繁体   English

如何对UTF-8字符串数组进行排序?

[英]How to sort an array of UTF-8 strings?

I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. 我currentyl不知道如何对包含PHP中UTF-8编码字符串的数组进行排序。 The array comes from a LDAP server so sorting via a database (would be no problem) is no solution. 该数组来自LDAP服务器,因此通过数据库排序(不会有问题)不是解决方案。 The following does not work on my windows development machine (although I'd think that this should be at least a possible solution): 以下内容不适用于我的Windows开发计算机(尽管我认为这至少应该是一个可能的解决方案):

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.65001'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);

The output is: 输出为:

string(20) "German_Germany.65001"
string(1) "C"
array(6) {
  [0]=>
  string(6) "Birnen"
  [1]=>
  string(9) "Ungetiere"
  [2]=>
  string(6) "Äpfel"
  [3]=>
  string(5) "Apfel"
  [4]=>
  string(9) "Ungetüme"
  [5]=>
  string(11) "Österreich"
}

This is complete nonsense. 这是完全废话。 Using 1252 as the codepage for setlocale() gives another output but still a plainly wrong one: 使用1252作为setlocale()的代码页会提供另一输出,但仍然是一个明显错误的输出:

string(19) "German_Germany.1252"
string(1) "C"
array(6) {
  [0]=>
  string(11) "Österreich"
  [1]=>
  string(6) "Äpfel"
  [2]=>
  string(5) "Apfel"
  [3]=>
  string(6) "Birnen"
  [4]=>
  string(9) "Ungetüme"
  [5]=>
  string(9) "Ungetiere"
}

Is there a way to sort an array with UTF-8 strings locale aware? 有没有一种方法可以对具有UTF-8字符串语言环境的数组进行排序?

Just noted that this seems to be PHP on Windows problem, as the same snippet with de_DE.utf8 used as locale works on a Linux machine. 刚刚指出,这似乎是Windows上的PHP问题,因为与de_DE.utf8用作语言环境的相同代码段在Linux计算机上工作。 Nevertheless a solution for this Windows-specific problem would be nice... 不过,针对此Windows特定问题的解决方案将是不错的...

$a = array( 'Кръстев', 'Делян1', 'делян1', 'Делян2', 'делян3', 'кръстев' );
$col = new \Collator('bg_BG');
$col->asort( $a );
var_dump( $a );

Prints: 印刷品:

array
  2 => string 'делян1' (length=11)
  1 => string 'Делян1' (length=11)
  3 => string 'Делян2' (length=11)
  4 => string 'делян3' (length=11)
  5 => string 'кръстев' (length=14)
  0 => string 'Кръстев' (length=14)

The Collator class is defined in PECL intl extension . Collator类在PECL intl扩展中定义。 It is distributed with PHP 5.3 sources but might be disabled for some builds. 它随PHP 5.3源一起分发,但对于某些版本可能已禁用。 Eg in Debian it is in package php5-intl . 例如在Debian中,它在软件包php5-intl中。

Collator::compare is useful for usort . Collator::compare对于usort很有用。

Update on this issue: 有关此问题的更新:

Even though the discussion around this problem revealed that we could have discovered a PHP bug with strcoll() and/or setlocale() , this is clearly not the case. 即使围绕该问题的讨论表明我们可以通过strcoll()和/或setlocale()发现一个PHP错误,但事实并非如此。 The problem is rather a limitation of the Windows CRT implementation of setlocale() (PHPs setlocale() is just a thin wrapper around the CRT call). 问题是Windows CRT对setlocale()实现的局限性(PHP setlocale()只是对CRT调用的精简包装)。 The following is a citation of the MSDN page "setlocale, _wsetlocale" : 以下是对MSDN页面“ setlocale,_wsetlocale”的引用:

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. 可用的语言,国家/地区代码和代码页集包括Win32 NLS API支持的所有语言, 但每个字符需要两个以上字节的代码页(例如UTF-7和UTF-8)除外。 If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL. 如果您提供类似UTF-7或UTF-8的代码页,则setlocale将失败,返回NULL。 The set of language and country/region codes supported by setlocale is listed in Language and Country/Region Strings. setlocale支持的语言和国家/地区代码集在“语言”和“国家/地区字符串”中列出。

It therefore is impossible to use locale-aware string operations within PHP on Windows when strings are multi-byte encoded. 因此,如果字符串是多字节编码的,则无法在Windows上的PHP中使用可识别语言环境的字符串操作。

Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. 最终,由于Huppie发现了明显的PHP错误,因此如果不使用ΤζΩΤΙΙΙΟΥ所建议的不使用重新编码的字符串(UTF-8→Windows-1252或ISO-8859-1),就无法以简单的方式解决此问题。 To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage. 总结问题,我创建了以下代码段,清楚地说明了问题是使用65001 Windows-UTF-8代码页时的strcoll()函数。

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';

$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
    $array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);

The result is: 结果是:

string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
  [0]=>
  string(1) "c"
  [1]=>
  string(1) "B"
  [2]=>
  string(1) "s"
  [3]=>
  string(1) "C"
  [4]=>
  string(1) "k"
  [5]=>
  string(1) "D"
  [6]=>
  string(2) "ä"
  [7]=>
  string(1) "E"
  [8]=>
  string(1) "g"
  [...]

The same snippet works on a Linux machine without any problems producing the following output: 相同代码段可在Linux机器上运行,而不会产生以下输出问题:

string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
  [0]=>
  string(1) "a"
  [1]=>
  string(1) "A"
  [2]=>
  string(2) "ä"
  [3]=>
  string(2) "Ä"
  [4]=>
  string(1) "b"
  [5]=>
  string(1) "B"
  [6]=>
  string(1) "c"
  [7]=>
  string(1) "C"
  [...]

The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then). 当使用Windows-1252(ISO-8859-1)编码的字符串(当然mb_ *编码和语言环境必须更改)时,该代码段也可以使用。

I filed a bug report on bugs.php.net : Bug #46165 strcoll() does not work with UTF-8 strings on Windows . 我在bugs.php.net上提交了错误报告: 错误#46165 strcoll()在Windows上不适用于UTF-8字符串 If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-). 如果您遇到相同的问题,则可以在错误报告页面上向PHP团队提供反馈(另外两个可能相关的错误被归类为伪造 -我认为此错误不是伪造的 ;-)。

Thanks to all of you. 感谢大家。

This is a very complex issue , since UTF-8 encoded data can contain any Unicode character (ie characters from many 8-bit encodings which collate differently in different locales). 这是一个非常复杂的问题 ,因为UTF-8编码的数据可以包含任何Unicode字符(即,来自许多8位编码的字符,它们在不同的语言环境中进行整理)。

Perhaps if you converted your UTF-8 data into Unicode (not familiar with PHP unicode functions, sorry) and then normalized them into NFD or NFKD and then sorting on code points might give some collation that would make sense to you (ie "A" before "Ä"). 也许,如果您将UTF-8数据转换为Unicode(不熟悉PHP unicode函数,对不起),然后将其标准化为NFD或NFKD ,然后对代码点进行排序,可能会得出一些对您有意义的排序规则(即“ A”在“Ä”之前)。

Check the links I provided. 检查我提供的链接。

EDIT: since you mention that your input data are clear (I assume they all fall in the "windows-1252" codepage), then you should do the following conversion: UTF-8 → Unicode → Windows-1252, on which Windows-1252 encoded data do a sort selecting the "CP1252" locale. 编辑:由于您提到输入数据是清晰的(我假设它们全部都位于“ windows-1252”代码页中),所以您应该执行以下转换:UTF-8→Unicode→Windows-1252,在其中Windows-1252编码的数据进行排序,选择“ CP1252”语言环境。

Using your example with codepage 1252 worked perfectly fine here on my windows development machine. 在Windows开发机上,将您的示例与代码页1252配合使用非常好。

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.1252'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);

...snip... ...剪...

This was with PHP 5.2.6. 这是PHP 5.2.6。 btw. 顺便说一句


The above example is wrong , it uses ASCII encoding instead of UTF-8. 上面的示例是错误的 ,它使用ASCII编码而不是UTF-8。 I did trace the strcoll() calls and look what I found: 我确实跟踪了strcoll()调用并查看了发现的内容:

 function traceStrColl($a, $b) { $outValue = strcoll($a, $b); echo "$a $b $outValue\\r\\n"; return $outValue; } $array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich'); setlocale(LC_COLLATE, 'German_Germany.65001'); usort($array, 'traceStrColl'); print_r($array); 

gives: 给出:

 Ungetüme Äpfel 2147483647 UngetümeÄpfel2147483647\nUngetüme Birnen 2147483647 UngetümeBirnen 2147483647\nUngetüme Apfel 2147483647 UngetümeApfel 2147483647\nUngetüme Ungetiere 2147483647 UngetümeUngetiere 2147483647\nÖsterreich Ungetüme 2147483647 ÖsterreichUngetüme2147483647\nÄpfel Ungetiere 2147483647 ÄpfelUngetiere 2147483647\nÄpfel Birnen 2147483647 ÄpfelBirnen 2147483647\nApfel Äpfel 2147483647 ApfelÄpfel2147483647\nUngetiere Birnen 2147483647 Ungetiere Birnen 2147483647 

I did find some bug reports which have been flagged being bogus ... The best bet you have is filing a bug-report I suppose though... 我确实发现了一些错误报告 ,这些错误报告被标记为伪造 。。。最好的办法是提交一个我认为可能的错误报告。

I found this following helper function to convert all letters of a string to ASCII letters very helpful here. 发现下面的帮助器功能在这里将字符串的所有字母转换为ASCII字母非常有用。

function _all_letters_to_ASCII($string) {
  return strtr(utf8_decode($string), 
    utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
    'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}

After that a simple array_multisort() gives you what you want. 之后,简单的array_multisort()会为您提供所需的内容。

$array = array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$reference_array = $array;

foreach ($reference_array as $key => &$value) {
  $value = _all_letters_to_ASCII($value);
}
var_dump($reference_array);

array_multisort($reference_array, $array);
var_dump($array);

Of course you can make the helper function fit more advanced needs. 当然,您可以使助手功能适合更高级的需求。 But for now, it looks pretty good. 但是目前看来,它还不错。

array(6) {
  [0]=> string(6) "Birnen"
  [1]=> string(5) "Apfel"
  [2]=> string(8) "Ungetume"
  [3]=> string(5) "Apfel"
  [4]=> string(9) "Ungetiere"
  [5]=> string(10) "Osterreich"
}

array(6) {
  [0]=> string(5) "Apfel"
  [1]=> string(6) "Äpfel"
  [2]=> string(6) "Birnen"
  [3]=> string(11) "Österreich"
  [4]=> string(9) "Ungetiere"
  [5]=> string(9) "Ungetüme"
}

I am confronted with the same problem with German "Umlaute". 德国的“ Umlaute”也面临着同样的问题。 After some research, this worked for me: 经过研究,这对我有用:

$laender =array("Österreich", "Schweiz", "England", "France", "Ägypten");  
$laender = array_map("utf8_decode", $laender);  
setlocale(LC_ALL,"de_DE@euro", "de_DE", "deu_deu");  
sort($laender, SORT_LOCALE_STRING);  
$laender = array_map("utf8_encode", $laender);  
print_r($laender);

The result: 结果:

Array 数组
(
[0] => Ägypten [0] =>Ägypten
[1] => England [1] =>英格兰
[2] => France [2] =>法国
[3] => Österreich [3] =>Österreich
[4] => Schweiz [4] =>施维兹
)

Your collation needs to match the character set. 您的排序规则需要与字符集匹配。 Since your data is UTF-8 encoded, you should use a UTF-8 collation. 由于您的数据是UTF-8编码的,因此您应该使用UTF-8归类。 It could be named differently on different platforms, but a good guess would be de_DE.utf8 . 在不同平台上可以使用不同的名称,但是很好的猜测是de_DE.utf8

On UNIX systems, you can get a list of currently installed locales with the command 在UNIX系统上,您可以使用以下命令获取当前安装的语言环境列表

locale -a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM