简体   繁体   English

MySQL,UTF-8和表情符号字符

[英]MySQL, UTF-8 and Emoji characters

I'm working on an iOS app with a PHP+MySQL backend. 我正在使用带有PHP + MySQL后端的iOS应用程序。 The app has a chat section, which needs to support emoji. 该应用程序有一个聊天部分,需要支持表情符号。 My tables are utf8_unicode_ci. 我的表是utf8_unicode_ci。 If I don't call 'set names utf8' in my scripts, emoji it actually works - whatever is entered in the database, is returned to the clients as it should. 如果我没有在脚本中调用“设置名称utf8”,则表情符号实际上可以工作-数据库中输入的任何内容均应返回给客户端。

The problem is that this (if I understand it correctly) stores special characters incorrectly in the database, and this breaks string comparing (ie ï is no longer the same as i when comparing strings). 问题是,这(如果我理解正确的话)在数据库中错误地存储了特殊字符,这破坏了字符串比较(即,当比较字符串时,ï与i不再相同)。

However, if I do call set names utf8, suddenly the emoji characters are inserted as a bunch of questionmarks. 但是,如果我确实调用了集合名称utf8,则突然将表情符号字符作为一堆问号插入。

Any suggestions on the proper way of handling this? 对处理此问题的正确方法有何建议? Thanks! 谢谢!

The issue is wether the db has a diacritical insensitive compare. 问题在于数据库是否具有变音符不敏感的比较。 The other issue is composed characters, ï can be expressed as either one unicode character or two forming a surrogate pair. 另一个问题是组合字符,ï可以表示为一个unicode字符或两个组成一个代理对的字符。 There are methods to convert a string to a pre-composed or decomposed form: precomposedStringWith* and decomposedStringWith*. 有一些将字符串转换为预组合或分解形式的方法:precomposedStringWith *和decomposedStringWith *。

It seems that MySQL supports two forms of unicode ucs2 (that is an older form that was supersede by utf16) which is 16-bits per character and utf8 up to 3 bytes per character. 似乎MySQL支持两种形式的ucs2 Unicode(这是utf16取代的较旧形式),即每个字符16位,而utf8每个字符最多3个字节。 The bad news is that neither form is going to support plane 1 characters which require at 17 bits. 坏消息是这两种格式都不支持平面17个字符的字符。 (mainly emoji). (主要是表情符号)。 It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 support BMP and supplementary characters (read emoji). 看起来MySQL 5.5.3及更高版本还支持utf8mb4,utf16和utf32,支持BMP和补充字符(阅读emoji表情)。 See MySQL Unicode Character Sets . 参见MySQL Unicode字符集

Here is some code and results to demonstrate the different unicode byte representations. 这是一些代码和结果,以演示不同的unicode字节表示形式。
Unicode is a 21 bit encoding system. Unicode是一种21位编码系统。
UTF32 directly represents the code points and clearly demonstrates decomposed surrogate pairs. UTF32直接表示代码点,并清楚地演示了分解的代理对。
UTF8 and UTF16 require one or more bytes to represent a unicode character. UTF8和UTF16需要一个或多个字节来表示Unicode字符。

NSLog(@"character: %@", @"Å");
NSLog(@"decomposedStringWithCanonicalMapping UTF8:  %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"decomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"decomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog(@"precomposedStringWithCanonicalMapping UTF8:  %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"precomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"precomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog(@"character: %@", @"😱");
NSLog(@"dataUsingEncoding UTF8:  %@", [@"😱" dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"dataUsingEncoding UTF16: %@", [@"😱" dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"dataUsingEncoding UTF32: %@", [@"😱" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

// For some surrogate pairs there is no other form //对于某些代理对,没有其他形式

NSString *aReverse = [[NSString alloc] initWithBytes:"\xD8\x3C\xDD\x70\x00" length:4 encoding:NSUTF16BigEndianStringEncoding];
NSLog(@"character: %@", aReverse);
NSLog(@"dataUsingEncoding UTF8:  %@", [aReverse dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"dataUsingEncoding UTF16: %@", [aReverse dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"dataUsingEncoding UTF32: %@", [aReverse dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog output: NSLog输出:

character: Å
decomposedStringWithCanonicalMapping UTF8:  <41cc8a>   
decomposedStringWithCanonicalMapping UTF16: <0041030a>   
decomposedStringWithCanonicalMapping UTF32: <00000041 0000030a>   

precomposedStringWithCanonicalMapping UTF8:  <c385>   
precomposedStringWithCanonicalMapping UTF16: <00c5>   
precomposedStringWithCanonicalMapping UTF32: <000000c5>   

character: 😱
dataUsingEncoding UTF8:  <f09f98b1>   
dataUsingEncoding UTF16: <d83dde31>   
dataUsingEncoding UTF32: <0001f631>   

character: 🅰
dataUsingEncoding UTF8:  <f09f85b0>
dataUsingEncoding UTF16: <d83cdd70>
dataUsingEncoding UTF32: <0001f170>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM