简体   繁体   中英

List of 'messed up characters' in utf8

one of my clients has a website which has been totally messed up by the hosting companie forcing a characterset on the complete database. We've had troubles before with character sets but now it's just straight forward a drama!

So far I've added the charset=utf-8 to the page content type and set the charset for the mysql connection to utf8. And now it's time to replace all characters. So far what I've found is:

ö = ö
ë = ë
é = é

The data inside the database is being updated like so:

UPDATE table SET `fieldname` = REPLACE(`fieldname`, 'ö', 'ö');

Now I just need to find a complete list of alle characters that are messed up. I tried a MySQL query searching for field LIKE '%Ã%' but this returns me all records inside the database.

Google also just displays a couple of characters (mostly the 3 above) in some topics of other people that have had troubles, however it seems there's nowhere a complete list of these characters (or at least the most common) which I can use to find and replace all data for my client.

If anyone perhaps knows such location or is able to complete my list I will, in return, create a page containing these characters to help others (unless there's a list already which I'm not aware of somewhere ofcourse).

// EDIT :

it would be for the most common european characters such as é è ë, á à ä, ö ó ò, ï, ü and perhaps the ringel-S (German double S). Not so much for the spaning signs like ñ or ã, but if they are in a list somewhere that would be much appreciated aswel.

// EDIT 2 :

I updated the MySQL database and tables using the 2 ALTER queries from the 1st part of this article: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet . I DID NOT make use of the mb_ functions so far and didn't do any MB configuration as it seems.

The headers are all set to utf-8 in the files (I still have to check the headers for some ajax scripts tho, not sure if that's needed but it won't be harmfull doing so). And the files are all saved as UTF8 without BOM. Also PHPFreakMailer is updated by setting the charset to utf-8.

Bad enough , I'm still having these weird characters. I wasn't thinking they'd go away by theirself, but at least it was worth hoping so :-) So what's the final step I should take? Continuïng using the REPLACE query and changing all wierd characters manually?

Thanks in advance!

This is a bit crazy; what character set do you think "ö" is in?

It looks like that's actually a correct UTF-8 sequence (since it's two bytes), you're just displaying it as ISO-8559-1.

Edit :

Based on your comment I think the following is going on:

I think (but really not 100% sure) that the correct UTF-8 binary sequence is stored in the database. But since the table is marked as ISO-8559-1, and you requested to automatically convert character set. So it thinks it's ISO-8559-1 (which looks like ö), but then tries to convert that to UTF-8.

You should be able to verify this, if strlen('ö') is 4, and not 2. If the length is indeed 2, your browser encoding somehow screws up.

To fix this, don't set the MySQL to encode the characters.

Option 2

The data could also be 'double encoded' in the table. To check this, simply also check the string length on the database. If the 'ö' is 4 bytes long, this is the issue.

My advice in this case is to not try to make a big 'messed up character'-map. You should simply be able to 'utf8_decode' the string. Normally this function will output a ISO-8559-1 string, but in your case.. it should turn out to be the original valid UTF-8 string.

I hope this works!

Edit2

Ok so effectively what I believe has happened is Option 2. To put it in simple (php) terms:

$output = utf8_encode(utf8_encode('string'));

So one utf8_decode() should be enough.

Do test this before you run your migration scripts though :)

If they forced a character change, why is your database not converted? Are your tables still the old character set (see your phpMyAdmin on table information).

Is the data wrong if it shows up in your phpMyAdmin or only on your webpage? -> your names and collation should change, as well as headers and filetype (safe file as utf-8).

Or try:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

I would start replacing characters only if there are no options from within MySQL left.

Since you've tagged this question with "php", I assume you read the database and it's values with PHP? If so, please have a look at mb_convert_encoding if you no longer have control over the database.

The better solution would be to fix the inconsistency between the data and the tables characterset. Backup the database (just in case), and alter all tables and columns to UTF-8. Note : when using MySQL, it is not enough to alter the table's charset, you'll have to do this per column.

Why don't you use: ä = ä ä = ä , ö = ö ö = ö ,...

Do htmlentities(); in php and it will convert all special characters into Entitys.
I think this would be the easiest way to do it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM