简体   繁体   中英

Advice on how to fix unicode, language issues in an existing database

I have a client who has messed up characters in his database (I inherited this project, and my guess is when users entered the text it wasn't processed or stored correctly, either via PHP or MySQL or both). For example,

Ex 1: the database field ("about") has values that look like this:

Dans la nature, face au ciel, un b%uFFFDb%uFFFD qui sourit quand on lui souffle sur le visage.

The collation on this field in MySQL is currently set to : latin1_swedish_ci

Ex 2: Another field ("description") looks like this:

VidÃÆ'©o tournÃÆ'©e dans le cadre

The collation on this field in MySQL is currently set to : utf_general_ci

Basically I have to fix all this. These examples are French but there are other records that may contain Japanese or Chinese (thus double-byte chars).

For entries like example 1, my plan is to change the field to utf_general_ci, and write a script to convert all the unicode codes to the characters (I'm not exactly sure how to do this latter part...ideas??).

For entries like example 2, I'm not sure what those odd characters are.

Is utf_general_ci the collation I should be using here to support all possible languages in one database table?

Other stats:

[peter@akebono A_PSG]$ php --version PHP 5.2.6 (cli) (built: May 8 2008 08:54:23) Copyright (c) 1997-2008 The PHP Group Zend Engine v2.2.0, Copyright (c) 1998-2008 Zend Technologies with Zend Debugger v5.2.14, Copyright (c) 1999-2008, by Zend Technologies

Have a look at this article on what approaches you could take : http://www.phpwact.org/php/i18n/charsets

I remember we had the same problem, but we used a mysql utility to change the encoding. I forget which now.

With PHP, you should be looking at iconv and the other character set encoding/decoding methods to detect the current encoding and change it to whatever standard you're going to go with.

EDIT

Also, have a look at the multi byte methods in php. Start with : http://www.php.net/manual/en/function.mb-convert-encoding.php

我不确定是否可以在不丢失数据的情况下将其解码回去,但是我建议您做的是在将数据插入数据库之前使用utf8_encode() ,因为当您尝试删除数据库时,这样做可以减少问题的数量例如,将数据输出到XML。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM