简体   繁体   中英

Character encoding fail, why does \xBD display improperly in PHP + HTML

I'm just trying to understand character encoding a bit better, so I'm doing a few tests.

I have a PHP file that is saved as UTF-8 and looks like this:

<?php
declare(encoding='UTF-8');

header( 'Content-type: text/html; charset=utf-8' );
?><!DOCTYPE html>

<html>

<head>
    <meta charset="UTF-8" />
    <title>Test</title>
</head>

<body>
    <?php echo "\xBD"; # Does not work ?>
    <?php echo htmlentities( "\xBD" ) ; # Works ?>
</body>

</html>

The page itself shows this:

在此输入图像描述

The gist of the problem is that my web application has a bunch of character encoding problems, where people are copying and pasting from Outlook or Word and the characters get transformed into the diamond question marks (Do those have a real name?)

I'm trying to learn how to make sure all my input is transformed into UTF-8 when the page loads (Basically $_GET , $_POST , and $_REQUEST ), and all output is done using proper UTF-8 handling methods.


My question is: Why is my page showing the question mark for the first echo, and does anyone have any other information about making a UTF-8 safe web app in PHP?

0xBD is not valid UTF-8. If you want to encode "½" in UTF-8 then you need to use 0xC2 0xBD instead.

>>> print '\xc2\xbd'.decode('utf-8')
½

If you want to use text from another charset (Latin-1 in this case) then you need to transcode it to UTF-8 first using the various iconv or mb functions.

Also:

$ charinfo �
U+FFFD REPLACEMENT CHARACTER

\\xBD无效,因为utf8你想要的是\\xC2\\xBD ,问号是什么应用程序替换无效的代码点,所以如果你在你的utf8文本中看到它不是utf8或已损坏。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM