如何处理无效UTF-8字符的用户输入？

Question

I'm looking for general a strategy/advice on how to handle invalid UTF-8 input from users. 我正在寻找关于如何处理来自用户的无效UTF-8输入的一般策略/建议。

Even though my webapp uses UTF-8, somehow some users enter invalid characters. 即使我的webapp使用UTF-8，某些用户也会输入无效字符。 This causes errors in PHP's json_encode() and overall seems like a bad idea to have around. 这会导致PHP的json_encode（）中的错误，并且总体来说似乎是一个坏主意。

W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.". W3C I18N常见问题解答：多语言表格说“如果收到非UTF-8数据，则应该发回错误信息。”。

How exactly should this be practically done, throughout a site with dozens of different places where data can be input? 在几十个不同的地方，可以输入数据，这究竟应该如何实际完成？
How do you present the error in a helpful way to the user? 如何以有用的方式向用户呈现错误？
How do you temporarily store and display bad form data so the user doesn't lose all their text? 如何暂时存储和显示错误的表单数据，以便用户不会丢失所有文本？ Strip bad characters? 剥掉坏人物？ Use a replacement character, and how? 使用替换角色，以及如何？
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode ()? mb_convert_encoding() ?), or leave as-is in the database but doing something (what?) before json_encode()? 对于数据库中的现有数据，当检测到无效的UTF-8数据时，我是否应该尝试将其转换并保存回来（如何？ utf8_encode （）？ mb_convert_encoding（）？），或者在数据库中保持原样但是做某事（什么？）在json_encode（）之前？

EDIT: I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP". 编辑：我非常熟悉mbstring扩展，并没有问“UTF-8如何在PHP中工作”。 I'd like advice from people with experience in real-world situations how they've handled this. 我希望那些在实际情况下有经验的人提供建议。

EDIT2: As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD EDIT2：作为解决方案的一部分，我真的很想看到一种将无效字符转换为U + FFFD的快速方法

Answer 1

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, they are not forced to submit that in that way, crappy form submission bots are a good example... accept-charset="UTF-8"属性只是浏览器遵循的指南，他们不会被迫以这种方式提交，糟糕的表单提交机器人就是一个很好的例子......

What I usually do is ignore bad chars, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions, if you use iconv you also have the option to transliterate bad chars. 我通常做的是忽略坏字符，通过iconv()或不太可靠的utf8_encode() / utf8_decode()函数，如果你使用iconv你也可以选择音译坏字符。

Here is an example using iconv() : 以下是使用iconv()的示例：

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis, something like this would probably do just fine: 如果你想向你的用户显示一条错误消息，我可能会以全局方式而不是每个接收到的值来做这件事，这样的事情可能会很好：

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

You may also want to normalize new lines and strip (non-)visible control chars, like this: 您可能还想规范化新行和剥离（非）可见控制字符，如下所示：

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
            return preg_replace('~\p{C}+~u', '', $string);
    }

    return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}

Code to convert from UTF-8 to Unicode codepoints: 从UTF-8转换为Unicode代码点的代码：

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

Probably faster than any other alternative, haven't tested it extensively though. 可能比任何其他选择更快，但没有广泛测试它。

Example: 例：

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

Is this what you were looking for? 这是你在找什么？

Answer 2

Receiving invalid characters from your web app might have to do with the character sets assumed for HTML forms. 从Web应用程序接收无效字符可能与为HTML表单假定的字符集有关。 You can specify which character set to use for forms with the accept-charset attribute : 您可以使用accept-charset属性指定要用于表单accept-charset ：

<form action="..." accept-charset="UTF-8">

You also might want to take a look at similar questions in StackOverflow for pointers on how to handle invalid characters, eg those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs. 您还可以查看StackOverflow中有关如何处理无效字符的指针的类似问题，例如右侧列中的那些，但我认为向用户发出错误信号比尝试清除那些无效字符更好。导致意外丢失重要数据或用户输入意外更改的字符。

Answer 3

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be: 我把一个相当简单的类放在一起，检查输入是否是UTF-8，并根据需要运行utf8_encode() ：

class utf8
{

    /**
     * @param array $data
     * @param int $options
     * @return array
     */
    public static function encode(array $data)
    {
        foreach ($data as $key=>$val) {
            if (is_array($val)) {
                $data[$key] = self::encode($val, $options);
            } else {
                if (false === self::check($val)) {
                    $data[$key] = utf8_encode($val);
                }
            }
        }

        return $data;
    }

    /**
     * Regular expression to test a string is UTF8 encoded
     * 
     * RFC3629
     * 
     * @param string $string The string to be tested
     * @return bool
     * 
     * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
     */
    public static function check($string)
    {
        return preg_match('%^(?:
            [\x09\x0A\x0D\x20-\x7E]              # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*$%xs',
            $string);
    }
}

// For example
$data = utf8::encode($_POST);

Answer 4

There is a multibyte extension for PHP, check it out: http://www.php.net/manual/en/book.mbstring.php 有一个PHP的多字节扩展，请查看： http ： //www.php.net/manual/en/book.mbstring.php

You should try mb_check_encoding() function. 你应该尝试mb_check_encoding（）函数。

Good luck! 祝好运！

Answer 5

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down. 我建议不要让垃圾进入。不要依赖自定义功能，这会使你的系统陷入困境。 Simply walk the submitted data against an alphabet you design. 只需将提交的数据与您设计的字母表对齐即可。 Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. 创建一个可接受的字母表字符串并逐字节地处理提交的数据，就好像它是一个数组一样。 Push acceptable characters to a new string, and omit unacceptable characters. 将可接受的字符推送到新字符串，并省略不可接受的字符。 The data you store in your database then is data triggered by the user, but not actually user-supplied data. 然后，存储在数据库中的数据是用户触发的数据，但实际上不是用户提供的数据。

EDIT #4: Replacing bad character with entiy: 编辑＃4：用entiy替换坏人：

EDIT #3: Updated : Sept 22 2010 @ 1:32pm Reason: Now string returned is UTF-8, plus I used the test file you provided as proof. 编辑＃3：更新时间：2010年9月22日@ 1:32 pm原因：现在返回的字符串是UTF-8，另外我使用了您提供的测试文件作为证据。

<?php
// build alphabet
// optionally you can remove characters from this array

$alpha[]= chr(0); // null
$alpha[]= chr(9); // tab
$alpha[]= chr(10); // new line
$alpha[]= chr(11); // tab
$alpha[]= chr(13); // carriage return

for ($i = 32; $i <= 126; $i++) {
$alpha[]= chr($i);
}

/* remove comment to check ascii ordinals */

// /*
// foreach ($alpha as $key=>$val){
//  print ord($val);
//  print '<br/>';
// }
// print '<hr/>';
//*/
// 
// //test case #1
// 
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv   '.chr(160).chr(127).chr(126);
// 
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
// 
// //test case #2
// 
// $str = ''.'©?™???';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';
// 
// $str = '©';
// $string = teststr($alpha,$str);
// print $string;
// print '<hr/>';

$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10),file($file));

$string = teststr($alpha,$testfile);
print $string;
print '<hr/>';


function teststr(&$alpha, &$str){
    $strlen = strlen($str);
    $newstr = chr(0); //null
    $x = 0;
    if($strlen >= 2){

        for ($i = 0; $i < $strlen; $i++) {
            $x++;
            if(in_array($str[$i],$alpha)){
                // passed
                $newstr .= $str[$i];
            }else{
                // failed
                print 'Found out of scope character. (ASCII: '.ord($str[$i]).')';
                print '<br/>';
                $newstr .= '&#65533;';
            }
        }
    }elseif($strlen <= 0){
        // failed to qualify for test
        print 'Non-existent.';

    }elseif($strlen === 1){
        $x++;
        if(in_array($str,$alpha)){
            // passed

            $newstr = $str;
        }else{
            // failed
            print 'Total character failed to qualify.';
            $newstr = '&#65533;';
        }
    }else{
        print 'Non-existent (scope).';
        }

if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8"){
// skip
}else{
    $newstr = utf8_encode($newstr);
}


// test encoding:
if(mb_detect_encoding($newstr, "UTF-8")=="UTF-8"){
    print 'UTF-8 :D<br/>';
    }else{
        print 'ENCODED: '.mb_detect_encoding($newstr, "UTF-8").'<br/>';
        }




return $newstr.' (scope: '.$x.', '.$strlen.')';
}

Answer 6

For completeness to this question (not necessarily the best answer)... 为了完整性这个问题（不一定是最好的答案）......

function as_utf8($s) {
    return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

Answer 7

How about stripping all chars outside your given subset. 如何剥离给定子集之外的所有字符。 At least in some parts of my application I would not allow using chars outside the [aZ] [0-9 sets], for example usernames. 至少在我的应用程序的某些部分，我不允许在[aZ] [0-9集]之外使用字符，例如用户名。 You can build a filter function that strips silently all chars outside this range, or that returns an error if it detects them and pushes the decision to the user. 您可以构建一个过滤器函数，该函数静默地剥离此范围之外的所有字符，或者如果它检测到它们则返回错误并将决定推送给用户。

Answer 8

Try doing what Rails does to force all browsers always to post UTF-8 data: 尝试做Rails所做的事情，强制所有浏览器始终发布UTF-8数据：

<form accept-charset="UTF-8" action="#{action}" method="post"><div
    style="margin:0;padding:0;display:inline">
    <input name="utf8" type="hidden" value="&#x2713;" />
  </div>
  <!-- form fields -->
</form>

See railssnowman.info or the initial patch for an explanation. 有关说明，请参阅railssnowman.info或初始修补程序。

To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag). 要让浏览器以UTF-8编码发送表单提交数据，只需使用Content-Type标题“text / html; charset = utf-8”（或使用meta http-equiv标记）呈现页面。
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form. 要让浏览器以UTF-8编码发送表单提交数据，即使用户使用页面编码（浏览器允许用户这样做），请在表单中使用accept-charset="UTF-8" 。
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is IE and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as ✓ 让浏览器以UTF-8编码发送表单提交数据，即使用户使用页面编码（浏览器允许用户这样做），即使浏览器是IE并且用户将页面编码切换为韩语和在表单字段中输入韩文字符，向表单添加一个隐藏的输入，其值为✓ which can only be from the Unicode charset (and, in this example, not the Korean charset). 它只能来自Unicode字符集（在本例中，不是韩文字符集）。

Answer 9

Set UTF-8 as the character set for all headers output by your PHP code 将UTF-8设置为PHP代码输出的所有标头的字符集

In every PHP output header, specify UTF-8 as the encoding: 在每个PHP输出标头中，指定UTF-8作为编码：

header('Content-Type: text/html; charset=utf-8');

如何处理无效UTF-8字符的用户输入？

问题描述

9 个解决方案

解决方案1
60 已采纳 2010-09-18 18:16:07

解决方案2
4 2010-09-15 06:56:44

解决方案3
2 2010-09-21 16:03:33

解决方案4
1 2010-09-15 06:50:03

解决方案5
1 2010-09-20 13:49:19

解决方案6
1 2010-09-25 01:24:49

解决方案7
0 2010-09-15 07:07:12

解决方案8
0 2010-09-15 15:04:18

解决方案9
0 2018-07-03 08:26:02

如何处理无效UTF-8字符的用户输入？

问题描述

9 个解决方案

解决方案1 60 已采纳 2010-09-18 18:16:07

解决方案2 4 2010-09-15 06:56:44

解决方案3 2 2010-09-21 16:03:33

解决方案4 1 2010-09-15 06:50:03

解决方案5 1 2010-09-20 13:49:19

解决方案6 1 2010-09-25 01:24:49

解决方案7 0 2010-09-15 07:07:12

解决方案8 0 2010-09-15 15:04:18

解决方案9 0 2018-07-03 08:26:02

解决方案1
60 已采纳 2010-09-18 18:16:07

解决方案2
4 2010-09-15 06:56:44

解决方案3
2 2010-09-21 16:03:33

解决方案4
1 2010-09-15 06:50:03

解决方案5
1 2010-09-20 13:49:19

解决方案6
1 2010-09-25 01:24:49

解决方案7
0 2010-09-15 07:07:12

解决方案8
0 2010-09-15 15:04:18

解决方案9
0 2018-07-03 08:26:02