简体   繁体   English

utf8中'混乱的字符'列表

[英]List of 'messed up characters' in utf8

one of my clients has a website which has been totally messed up by the hosting companie forcing a characterset on the complete database. 我的一个客户有一个网站,主机公司已经完全搞砸了整个数据库上的字符集。 We've had troubles before with character sets but now it's just straight forward a drama! 我们之前遇到过麻烦的角色,但现在它只是一场戏剧!

So far I've added the charset=utf-8 to the page content type and set the charset for the mysql connection to utf8. 到目前为止,我已将charset = utf-8添加到页面内容类型,并将mysql连接的charset设置为utf8。 And now it's time to replace all characters. 现在是时候替换所有角色了。 So far what I've found is: 到目前为止,我发现的是:

ö = ö
ë = ë
é = é

The data inside the database is being updated like so: 数据库中的数据正在更新,如下所示:

UPDATE table SET `fieldname` = REPLACE(`fieldname`, 'ö', 'ö');

Now I just need to find a complete list of alle characters that are messed up. 现在我只需要找到一个完整的字符列表,这些字符被搞砸了。 I tried a MySQL query searching for field LIKE '%Ã%' but this returns me all records inside the database. 我尝试了一个MySQL查询搜索field LIKE '%Ã%'但这会返回数据库中的所有记录。

Google also just displays a couple of characters (mostly the 3 above) in some topics of other people that have had troubles, however it seems there's nowhere a complete list of these characters (or at least the most common) which I can use to find and replace all data for my client. 谷歌也只是在其他人遇到麻烦的一些主题中显示了几个字符(大多数是上面的3个字符),但似乎没有一个完整的这些字符列表(或者至少是最常见的)我可以用来查找并替换我的客户端的所有数据。

If anyone perhaps knows such location or is able to complete my list I will, in return, create a page containing these characters to help others (unless there's a list already which I'm not aware of somewhere ofcourse). 如果有人知道这样的位置或者能够完成我的列表,我将作为回报创建一个包含这些字符的页面以帮助其他人(除非已经有一个我不知道某个地方的列表)。

// EDIT : // EDIT

it would be for the most common european characters such as é è ë, á à ä, ö ó ò, ï, ü and perhaps the ringel-S (German double S). 这将是最常见的欧洲人物,如éèë,áàä,öóò,ï,ü,也许是ringel-S(德国双S)。 Not so much for the spaning signs like ñ or ã, but if they are in a list somewhere that would be much appreciated aswel. 对于像ñ或ã这样的耸人听闻的标志而言,并非如此,但如果它们在某个地方的列表中会非常受欢迎。

// EDIT 2 : // EDIT 2

I updated the MySQL database and tables using the 2 ALTER queries from the 1st part of this article: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet . 我使用本文第1部分中的2个ALTER查询更新了MySQL数据库和表: http://developer.loftdigital.com/blog/php-utf-8-cheatsheethttp://developer.loftdigital.com/blog/php-utf-8-cheatsheet I DID NOT make use of the mb_ functions so far and didn't do any MB configuration as it seems. 我到目前为止还没有使用mb_函数,并且没有像看起来那样进行任何MB配置。

The headers are all set to utf-8 in the files (I still have to check the headers for some ajax scripts tho, not sure if that's needed but it won't be harmfull doing so). 标题都在文件中设置为utf-8(我仍然需要检查一些ajax脚本的标题,不确定是否需要,但这样做不会有害)。 And the files are all saved as UTF8 without BOM. 并且文件全部保存为UTF8而没有BOM。 Also PHPFreakMailer is updated by setting the charset to utf-8. 此外,通过将charset设置为utf-8来更新PHPFreakMailer。

Bad enough , I'm still having these weird characters. Bad enough ,我还有这些奇怪的角色。 I wasn't thinking they'd go away by theirself, but at least it was worth hoping so :-) So what's the final step I should take? 我不认为他们会自己离开,但至少值得希望如此:-)那么我应该采取的最后一步是什么? Continuïng using the REPLACE query and changing all wierd characters manually? 继续使用REPLACE查询并手动更改所有奇怪的字符?

Thanks in advance! 提前致谢!

This is a bit crazy; 这有点疯狂; what character set do you think "ö" is in? 你觉得“¶”是什么字符集?

It looks like that's actually a correct UTF-8 sequence (since it's two bytes), you're just displaying it as ISO-8559-1. 它看起来实际上是一个正确的UTF-8序列(因为它是两个字节),你只是将它显示为ISO-8559-1。

Edit : 编辑

Based on your comment I think the following is going on: 根据您的评论,我认为以下是:

I think (but really not 100% sure) that the correct UTF-8 binary sequence is stored in the database. 认为 (但实际上并非100%确定)正确的UTF-8二进制序列存储在数据库中。 But since the table is marked as ISO-8559-1, and you requested to automatically convert character set. 但由于该表标记为ISO-8559-1,并且您要求自动转换字符集。 So it thinks it's ISO-8559-1 (which looks like ö), but then tries to convert that to UTF-8. 所以它认为它是ISO-8559-1(看起来像¶),但后来尝试将其转换为UTF-8。

You should be able to verify this, if strlen('ö') is 4, and not 2. If the length is indeed 2, your browser encoding somehow screws up. 你应该能够验证这一点,如果strlen('Ã'')是4,而不是2.如果长度确实是2,那么你的浏览器编码会以某种方式搞砸。

To fix this, don't set the MySQL to encode the characters. 要解决此问题,请不要将MySQL设置为对字符进行编码。

Option 2 选项2

The data could also be 'double encoded' in the table. 数据也可以在表格中“双重编码”。 To check this, simply also check the string length on the database. 要检查这一点,只需检查数据库上的字符串长度。 If the 'ö' is 4 bytes long, this is the issue. 如果'Ã'是4个字节长,这就是问题所在。

My advice in this case is to not try to make a big 'messed up character'-map. 在这种情况下,我的建议是不要试图制作一个大的“混乱的人物”地图。 You should simply be able to 'utf8_decode' the string. 你应该只需要'utf8_decode'字符串。 Normally this function will output a ISO-8559-1 string, but in your case.. it should turn out to be the original valid UTF-8 string. 通常这个函数会输出一个ISO-8559-1字符串,但在你的情况下..它应该是原来有效的UTF-8字符串。

I hope this works! 我希望这有效!

Edit2 EDIT2

Ok so effectively what I believe has happened is Option 2. To put it in simple (php) terms: 确实如此有效我认为发生的是选项2.用简单的(php)术语来表达:

$output = utf8_encode(utf8_encode('string'));

So one utf8_decode() should be enough. 所以一个utf8_decode()就足够了。

Do test this before you run your migration scripts though :) 在运行迁移脚本之前测试一下:)

If they forced a character change, why is your database not converted? 如果他们强制改变字符,为什么你的数据库没有被转换? Are your tables still the old character set (see your phpMyAdmin on table information). 您的表格仍然是旧的字符集(请参阅表格信息中的phpMyAdmin)。

Is the data wrong if it shows up in your phpMyAdmin or only on your webpage? 如果数据显示在您的phpMyAdmin中或仅显示在您的网页上,数据是否有误? -> your names and collation should change, as well as headers and filetype (safe file as utf-8). - >您的名称和归类应该更改,以及标题和文件类型(安全文件为utf-8)。

Or try: 或尝试:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

I would start replacing characters only if there are no options from within MySQL left. 只有当MySQL中没有选项时,我才会开始替换字符。

Since you've tagged this question with "php", I assume you read the database and it's values with PHP? 既然你用“php”标记了这个问题,我假设你用PHP读取数据库及其值? If so, please have a look at mb_convert_encoding if you no longer have control over the database. 如果是这样,如果您无法再控制数据库,请查看mb_convert_encoding

The better solution would be to fix the inconsistency between the data and the tables characterset. 更好的解决方案是修复数据和表格字符集之间的不一致。 Backup the database (just in case), and alter all tables and columns to UTF-8. 备份数据库(以防万一),并将所有表列更改为UTF-8。 Note : when using MySQL, it is not enough to alter the table's charset, you'll have to do this per column. 注意 :使用MySQL时,仅改变表的字符集是不够的,你必须按列进行操作。

Why don't you use: ä = ä 你为什么不用: ä = ä ä = ä , ö = ö ä = äö = ö ö = ö ,... ö = ö ,......

Do htmlentities(); htmlentities(); in php and it will convert all special characters into Entitys. 在PHP中,它会将所有特殊字符转换为实体。
I think this would be the easiest way to do it. 我认为这是最简单的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM