简体   繁体   English

PHP mysql土耳其字符编码和比较

[英]PHP mysql turkish character encoding and comparison

I am trying to filter Turkish names from MySql database through AJAX POST, the English letter words are listing all okay however if I send Ö (which is letter O with dots) the results come for both O and Ö not only Ö 我试图通过AJAX POST从MySql数据库中过滤土耳其语名称,英语字母单词列出一切正常但是如果我发送Ö(带字母的字母O)结果来自O和Ö不仅Ö

Also what I noticed is the AJAX post is send Ö as %C3%96, anybody can help? 另外我注意到的是AJAX帖子是发送Ö%C3%96,任何人都可以帮忙吗?

Please bare my somewhat lengthy response. 请稍微冗长一点。
Let's start with your second question. 让我们从你的第二个问题开始。 %C3%96 means that the bytes 0xC3 and 0x96 are transmitted. %C3%96表示发送字节0xC3和0x96。 Those two bytes encode the character Ö in utf-8 . 这两个字节在utf-8中编码字符Ö
From this (and that your query yields the described results) I assume that you're using utf-8 all the way through . 从这一点(以及你的查询产生描述的结果)我假设你一直在使用utf-8

The lexicographical order of characters of a given charset is determined by the collation used. 给定字符集的字符的字典顺序由所使用的校对确定。
That's more or less an ordered list of characters. 这或多或少是一个有序的字符列表。 Eg A,B,C,D,.... meaning A<B<C .... 例如A,B,C,D,....意思是A<B<C ....
But these lists my contain multiple characters in the same "location", eg 但是这些列表我在同一个“位置”包含多个字符,例如
[A,Ä],B,C,D.... meaning that A==Ä->true [A,Ä],B,C,D ....意思是A==Ä->true

___ excursion, not immediately relevant to your question ____ ___短途旅行,与您的问题没有直接关系____
Let's take a look at the "name" of the character Ö , it's LATIN CAPITAL LETTER O WITH DIAERESIS . 让我们来看看角色Ö的“名字”,这是LATIN CAPITAL LETTER O WITH DIAERESIS
So, the base character is O, it just has some decoration(s). 所以,基本字符是O,它只是有一些装饰。
Some systems/libraries allow you to specify the "granularity"/level/strength of the comparison, see eg Collator::setStrength of the php-intl extension. 某些系统/库允许您指定比较的“粒度”/级别/强度,请参阅例如php-intl扩展的Collat​​or :: setStrength

<?php
// utf8 characters
define('SMALL_O_WITH_DIAERESIS', chr(0xC3) . chr(0xB6));
define('CAP_O_WITH_DIAERESIS', chr(0xC3) . chr(0x96));

$coll = collator_create( 'utf-8' );
foreach( array('PRIMARY', 'SECONDARY', 'TERTIARY') as $strength) {
    echo $strength, "\r\n";
    $coll->setStrength( constant('Collator::'.$strength) );
    echo '  o ~ ö = ', $coll->compare('o', SMALL_O_WITH_DIAERESIS), "\r\n";
    echo '  Ö ~ ö = ', $coll->compare(CAP_O_WITH_DIAERESIS, SMALL_O_WITH_DIAERESIS), "\r\n";
}

prints 版画

PRIMARY
  o ~ ö = 0
  Ö ~ ö = 0
SECONDARY
  o ~ ö = -1
  Ö ~ ö = 0
TERTIARY
  o ~ ö = -1
  Ö ~ ö = 1

On the primary level all the involved characters (o,O,ö,Ö) are just some irrelevant variations of the character O, so all are regarded as equal. 在初级阶段,所有涉及的字符(o,O,ö,Ö)只是字符O的一些无关变体,因此所有字符都被认为是相等的。
On the secondary level the additional "feature" WITH DIAERESIS is taken into consideration and on the third level also whether it is a small or a capital letter. 在二级,考虑到WITH DIAERESIS的附加“特征”,在第三级也考虑它是小写还是大写字母。
But ...MySQL doesn't exactly work that way ...so, sorry again ;-) 但是...... MySQL并没有完全按照这种方式工作......所以,再次抱歉;-)
___ end of excursion ____ ___游览结束____

In MySQL there are collation tables that specify the order. 在MySQL中,有一些用于指定顺序的排序规则表。 When you select a charset you also implictly select the default collation for that charset, unless you explictly specify one. 当您选择一个字符集时,您还要隐含地选择该字符集的默认排序规则,除非您明确指定一个字符集。 In your case the implictly selected collation is probably utf8_general_ci and it treats ö==o. 在你的情况下,隐含选择的排序规则可能是utf8_general_ci ,它对待ö== o。
This applies to both the table defintion and charset/collation of the connection (the latter being almost irrelevant in your case). 这适用于表格定义和连接的charset / collat​​ion(后者在您的情况下几乎不相关)。
utf8_turkish_ci on the other hand treats ö!=o. 另一方面,utf8_turkish_ci对待ö!= o。 That's probably the collation you want. 这可能是你想要的整理。

When you have a table defintion like 当你有一个像表定义

CREATE TABLE soFoo (
  x varchar(32)
)
CHARACTER SET utf8

the default collation for utf8 is chosen -> general_ci -> o=ö 选择utf8的默认排序规则 - > general_ci - > o =ö
You can specifiy the default collation for the table when defining it 您可以在定义表时指定表的默认排序规则

CREATE TABLE soFoo (
  x varchar(32)
)
CHARACTER SET utf8 COLLATE utf8_turkish_ci

Since you already have a table plus data, you can change the collation of the table ...but if you do it on the table level you have to use ALTER TABLE ... CONVERT (in case you use MODIFY, the column keeps its "original" collation). 由于您已经有一个表加数据,您可以更改表的排序规则...但如果您在表级别执行此操作,则必须使用ALTER TABLE ... CONVERT (如果您使用MODIFY,则该列保留其“原始”整理)。

ALTER TABLE soFoo CONVERT TO CHARACTER SET utf8 COLLATE utf8_turkish_ci

That should pretty much take care of your problem. 这应该照顾你的问题。


As a side note there is (as mentioned) a collation assigned to your connection as well. 作为旁注,还有(如上所述) 分配给您的连接排序规则 Selecting a charset means selecting a collation. 选择字符集意味着选择排序规则。 I use mainly PDO when (directly) connecting to MySQL and my default connection code looks like this 当(直接)连接到MySQL时,我主要使用PDO ,我的默认连接代码如下所示

$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8', 'localonly', 'localonly', array(
    PDO::ATTR_EMULATE_PREPARES=>false,
    PDO::MYSQL_ATTR_DIRECT_QUERY=>false,
    PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION
));

note the charset=utf8 ; 请注意charset=utf8 ; no collation, so again general_ci is assigned to the connection. 没有排序规则,所以再次将general_ci分配给连接。 And that's why 这就是原因

<?php
$pdo = new PDO('mysql:host=localhost;dbname=test;charset=utf8', 'localonly', 'localonly', array(
    PDO::ATTR_EMULATE_PREPARES=>false,
    PDO::MYSQL_ATTR_DIRECT_QUERY=>false,
    PDO::ATTR_ERRMODE=>PDO::ERRMODE_EXCEPTION
));

$smallodiaresis_utf8 = chr(0xC3) . chr(0xB6);
foreach( $pdo->query("SELECT 'o'='$smallodiaresis_utf8'") as $row ) {
    echo $row[0];
}

prints 1 meaning o==ö. 打印1意思是o ==ö。 The string literals used in the statement are treated as utf8/utf8_general_ci. 语句中使用的字符串文字被视为utf8 / utf8_general_ci。

I could either specify the collation for the string literal explicitly in the statement 我可以在语句中明确指定字符串文字的排序规则

SELECT 'o' COLLATE utf8_turkish_ci ='ö'

(only setting it for one of the two literals/operands; for why and how this works see Collation of Expressions ) (仅为两个文字/操作数之一设置;为什么以及如何工作,请参阅表达式的排序
or I can set the connection collation via 或者我可以通过设置连接校对

$pdo->exec("SET collation_connection='utf8_turkish_ci'");

both result in 两者都导致了

foreach( $pdo->query("SELECT 'o'[...]='$smallodiaresis_utf8'") as $row ) {
    echo $row[0];
}

printing 0 . 印刷0

edit: and to complicate things even a bit further: 编辑:并使事情进一步复杂化:
The charset utf8 can't represent all possible characters. 字符集utf8不能代表所有可能的字符。 There's an even broader character set utf8mb4 . 有一个更广泛的字符集utf8mb4

The PHP code should be receiving %C3%96 suitably decoded back to Ö . PHP代码应该接收%C3%96适当地解码回Ö But if not, then apply the PHP function urldecode() to the string. 但如果没有,那么将PHP函数urldecode()应用于字符串。

You will still have the character Ö , not O ; 你仍然会有角色Ö ,而不是O ; is that OK? 这可以吗?

If you get Ö , then there is a mixture of utf8 and latin1. 如果你得到Ö ,那么就有utf8和latin1的混合物。 That is a different problem. 这是一个不同的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM