简体   繁体   English

UTF-8,PHP和XML Mysql

[英]UTF-8, PHP and XML Mysql

I am having great problems solving this one: 我在解决这个问题时遇到了很多问题:

I have a mysql database encoding latin1_swedish_ci and a table that stores names and addresses. 我有一个mysql数据库编码latin1_swedish_ci和一个存储名称和地址的表。

I am trying to output a UTF-8 XML file, but I am having problems with the following string: 我试图输出UTF-8 XML文件,但我遇到以下字符串的问题:

Otivägen it is being outputted as Otivägen when i vim the file. Otivägen当我修改文件时它被输出为Otivägen Also when opened it IE i get 另外当我打开IE时我得到了

" An invalid character was found in text content. Error processing resource " An invalid character was found in text content. Error processing resource

I have the following code: 我有以下代码:

function fixEncoding($in_str)
{
    $cur_encoding = mb_detect_encoding($in_str) ;
    if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
        return $in_str;
    else
        return utf8_encode($in_str);
}

header("Content-type: text/plain;charset=utf-8");
$mystring = "Otivägen" // this is actually obtained from database;

$myxml = "<myxml>
....
     <node>".$mystring."</node>
....
</myxml>
";
$myxml = fixEncoding($myxml);

The actual XML output is below: 实际的XML输出如下:

<?xml version="1.0" encoding="UTF-8" ?>
<myxml>
    ....
    <node>Otivägen</node>
    ....
</myxml>

Any ideas how I can output the file so in vim the file reads Otivägen and not Otivägen ? 任何想法我如何输出文件所以在文件中文件读取Otivägen而不是Otivägen

EDIT: 编辑:

I did mysql_client_encoding() and got latin1 我做了mysql_client_encoding()并获得了latin1
I then did mysql_set_charset() 然后我做了mysql_set_charset()
and again ran mysql_client_encoding() and got utf8, but still the same outputting issues. 并再次运行mysql_client_encoding()并得到utf8,但仍然是相同的输出问题。

Edit 2 编辑2

I have logged into the command line and run the query SELECT address1 FROM address WHERE id = 1000; 我已登录命令行并运行查询SELECT address1 FROM address WHERE id = 1000;

SELECT address1 FROM address WHERE id = 1000;
Current database: ftpuser_db

+-------------+
|   address1  |
+-------------+
| Otivägen 32 |
+-------------+
1 row in set (0.06 sec)

Thanks in advance! 提前致谢!

Is your MySQL connection encoding properly set to UTF-8 ? 您的MySQL连接编码是否正确设置为UTF-8

Check mysql_set_charset() and mysql_client_encoding() for more details. 检查mysql_set_charset()mysql_client_encoding()以获取更多详细信息。

Oh boy. 好家伙。 UTF8 issues can be a real pain and they get almost impossible to solve when something is doing re-encodings for you. UTF8问题可能是一个真正的痛苦,当某些事情正在为您进行重新编码时,它们几乎无法解决。

You really need to start at one end and make sure every process is UTF8. 你真的需要从一端开始,确保每个进程都是UTF8。 That will remove things in the process from interpreting the data wrong and 'converting' it for you. 这将删除过程中的事情,从错误地解释数据并为您“转换”它。 But significantly, it will also let you much more easily spot when something has already mis-encoded text for you (yes, I've had that problem). 但显而易见的是,它还可以让您更容易发现某些内容已经错误编码的文本(是的,我遇到了这个问题)。

And if you have UTF8 data in tables that aren't set to UTF8 and might be mis-encoded, you need to do the tables last, after the data has been re-encoded. 如果表中的UTF8数据未设置为UTF8且可能编码错误,则需要在重新编码数据后最后执行这些表。 Otherwise you will damage your data irretrievably. 否则,您将无法挽回地损坏您的数据。 I've had that problem, too. 我也有这个问题。

First steps: 第一步:

  • Check your terminal is UTF8 compliant. 检查您的终端是否符合UTF8。 Gnome-terminal is. Gnome-terminal是。 Kterm is. Kterm是。 ETerm is not. ETerm不是。
  • Check your LANG setting in your shell. 检查shell中的LANG设置。 It should probably have .UTF-8 on the end of it's value. 它应该在它的值的末尾有.UTF-8。
  • Check that vim is picking up the UTF8 setting correctly. 检查vim是否正确获取UTF8设置。 You can check with :set encoding 您可以查看:set encoding

This will mean that your files will be edited in UTF8. 这意味着您的文件将以UTF8进行编辑。

Now we check MySQL. 现在我们检查MySQL。

In the MySQL CLI, do show variables like 'character_set%'; 在MySQL CLI中,确实show variables like 'character_set%'; . The results will probably be something like: 结果可能是这样的:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

What you're aiming for is to change all those latin1 values (or whatever you're seeing) to utf8 . 你的目标是将所有那些latin1值(或者你所看到的)改为utf8

set names utf8; will change most of them and you might need to do that with every new connection in your database. 将更改其中的大部分,您可能需要对数据库中的每个新连接执行此操作。 This was the solution I had to adopt in a previous application. 这是我在之前的应用程序中必须采用的解决方案。 The other settings to change are in the my.cnf file for which I need to direct you to the documentation . 要更改的其他设置位于my.cnf文件中,我需要将其引导至文档 It is unlikely you will need to set them all. 您不太可能需要全部设置它们。

I see you're already setting the output headers, so that's good. 我看到你已经设置了输出标题,所以这很好。

Now you can look at the data from the database and see why it's "wrong". 现在,您可以查看数据库中的数据,看看它为何“错误”。

I think you did everything correctly, except that your terminal is in Latin-1. 我认为你做的一切都正确,除了你的终端是拉丁语-1。

The UTF-8 sequence for ä is C3 A4, which is ä if displayed as Latin-1. ä的UTF-8序列是C3 A4,如果显示为Latin-1则为ä。

latin1_swedish_ci is a collation, not a charset. latin1_swedish_ci是一个整理,而不是一个字符集。 Since collations are supposed to match their charset, it suggests that the table is using latin1, but it's not a guarantee. 由于排序规则应与其字符集匹配,因此表明该表正在使用latin1,但这不是保证。

Strictly speaking, the charset of tables is irrelevant here, since MySql can convert input/output. 严格地说,表的字符集在这里是无关紧要的,因为MySql可以转换输入/输出。 That's what the connection charset ( mysql_set_charset ) is for. 这就是连接字符集( mysql_set_charset )的用途。 However, for that to work properly, the data needs to be encoded properly in the database. 但是,要使其正常工作,需要在数据库中正确编码数据。 I would begin by checking that strings are correct in the database. 我首先检查数据库中的字符串是否正确。 Simplest thing is to log in on the command line and select a row which has non-ascii characters in it. 最简单的方法是登录命令行并选择其中包含非ascii字符的行。 Does it look OK? 它看起来不错吗?

$mystring = "Otivägen" // this is actually obtained from database;

Watch out. 小心。 The encoding of the data in $mystring will now depend on the encoding of the php file. $mystring中的数据编码现在将取决于php文件的编码。 That may or may not be the same as the data in the database. 这可能与数据库中的数据相同,也可能不同。

before output run query SET NAMES utf8 在输出运行之前查询SET NAMES utf8

after output you can go back and run SET NAMES latin1 输出后,您可以返回并运行SET NAMES latin1

Look here, I've got the same problem 这里,我遇到了同样的问题

It seems you are "double encoding" Otivägen. 看来你是“双重编码”Otivägen。 You get this behaviour if Otivägen already is UTF-8, and run utf8_encode() on it again. 如果Otivägen已经是UTF-8,则会出现此行为,并再次对其运行utf8_encode()。 Example: 例:

$str = "Otivägen"; // already an UTF-8 string
echo utf8_encode($str); // outputs Otivägen

I'm not sure we're the actual "double encoding" occurs, but it may be due to settings in your editor. 我不确定我们是否发生了实际的“双重编码”,但可能是由于编辑器中的设置。 My theory. 我的理论。 Lets say you are running Aptana Studio: Your actual character set is set to ISO-8859-1 (in Aptana, you can check this by right clicking on a file and choose "properties". To set default character encoding for all projects, choose Preferences from Aptana main menu -> General -> workspace). 假设您正在运行Aptana Studio:您的实际字符集设置为ISO-8859-1(在Aptana中,您可以通过右键单击文件并选择“属性”来检查这一点。要为所有项目设置默认字符编码,请选择来自Aptana主菜单的首选项 - >常规 - >工作区)。 If that's the case, the actual PHP source file where you have $myxml and its string <myxml><node>... is detected to be ISO-8859-1, but $mystring received from the database is UTF-8. 如果是这种情况,那么你有$myxml及其字符串<myxml><node>...的实际PHP源文件被检测为ISO-8859-1,但是从数据库收到的$ mystring是UTF-8。 Your fixEncoding function would then run the else clause, since the $myxml as a whole is seen as ISO-8859-1 and not UTF-8. 然后你的fixEncoding函数将运行else子句,因为$ myxml作为一个整体被视为ISO-8859-1而不是UTF-8。 This results in double encoding the results from the database, and may be the cause to your problem. 这会导致对数据库的结果进行双重编码,这可能是导致问题的原因。

Check the encoding of your actual source file in your editor, and verify that it is set to UTF-8. 在编辑器中检查实际源文件的编码,并验证它是否设置为UTF-8。 Alternatively, experiment with applying or removing fixEncoding/utf8_encode/utf8_decode to $myxml. 或者,尝试将fixEncoding / utf8_encode / utf8_decode应用或移除到$ myxml。 Observe the results and see what needs to be done to the value Otivägen right. 观察结果,看看需要对Otivägen的价值做些什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM