简体   繁体   中英

Encoding in UTF-8 from PHP

I am not that good with encoding but I am even falling over with the basics here.

I am trying to create a file that is recognised as UTF-8

header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo "test";
exit();

also tried

header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo utf8_encode("test");
exit();

I then open the file with Notepad++ and it says its current encoding is ANSI not UTF-8, what am I missing how should I be outputting this file.

I will eventually be outputting an XML file of products for the Affiliate Window program. Also if it helps My webserver is Centos, Apache2, PHP 5.2.8.

Thanks in advance for any help!

As Filip said, encoding is not an intrinsic attribute of a file; It's implicit. This means that unless you know what encoding a file is to be interpreted in, there is no way to determine it. The best you can do, is to make a guess. This is presumably what programs such as Notepad++ does. Since the actual data that you have sent, can be interpreted in many different encodings, it just picks the candidate that it likes best. For Notepad++ this appears to be ANSI (Which in itself is a rather inaccurate classification), while other programs might default to something else.

The reason why you have to specify the charset in a HTTP-header is exactly because the file itself doesn't contain this information, so the browser needs to be informed about it. Once you have saved the file to disk, this information is thus unavailable.

If the file you're going to serve is an XML-document, you have the option of putting the encoding information inside the actual document. That way it is preserved after the file is saved to disk. Eg. if you are using utf-8, you should put this at the top of your document:

<?xml version="1.0" encoding="utf-8" ?>

Note that apart from getting the meta-information about the charset across, you also need to make sure that the data you are serving is actually utf-8 encoded. This is much the same scenario: You need to know implicitly what encoding your data are in. The function utf8_encode is (despite the name) explicitly meant for converting iso-8859-1 into utf-8. Thus, if you use it on already utf-8 encoded data, you'll get it double-encoded, with the result of garbled data.

Charsets aren't that complicated in itself. The problem is that if you aren't careful about keeping things straight you'll mess it up. Whenever you have a string, you should be absolutely certain that you know which encoding it is in. Otherwise it's not a string - it's just a blob of binary data.

test is all ASCII. So there is no need to use UTF-8 for that.

But in fact, the first 128 characters of the Unicode charset are the same as ASCII's charset. And UTF-8 uses the same code for that characters as ASCII does. See Wikipedia's description of UTF-8 for furhter information.

Once you download the file it no longer carries the information about the encoding, so Notepad++ has to guess it from the contents. There's a thing called Byte-Order-Mark which allows specifying the UTF encodings by prefix in the contents.

See question "When a BOM is used, is it only in 16-bit Unicode text?" .

I would imagine using something like echo "\\xEF\\xBB\\xBF" before writing the actual contents will force Notepad++ to recognize the file correctly.

There is no such thing as headers for downloaded txt-files. As you try to create XML files in the end anyway, and you can specify the charset in the XML declaration, try creating a simple XML structure and save / open that, then it should work, as long as the OS has utf-8 support, which any modern Linux distribution should have.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM