简体   繁体   English

PHP输入过滤-检查ascii与检查utf8

[英]PHP input filtering - checking ascii vs checking utf8

I need to insure that all my strings are utf8. 我需要确保我所有的字符串都是utf8。 Would it be better to check that input coming from a user is ascii-like or that it is utf8-like? 检查来自用户的输入是否类似于ASCII或类似于utf8更好?

//KohanaPHP
function is_ascii($str) {
    return ! preg_match('/[^\x00-\x7F]/S', $str);
}

//Wordpress
function seems_utf8($Str) {
    for ($i=0; $i<strlen($Str); $i++) {
        if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
        elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
            return false;
        }
    }
    return true;
}

I did some benchmarking on 100 strings (half valid utf8/ascii and half not) and found that seems_utf8() tasks 0.011 while is_ascii only takes 0.001. 我对100个字符串进行了一些基准测试(一半有效的utf8 / ascii,另一半没有),发现似乎sees_utf8()的任务为0.011,而is_ascii只需要0.001。 But my gut is telling me that you get what you pay for and the utf8 checking would be a better choice. 但是我的直觉告诉我,您所付的钱是多少,而utf8检查将是一个更好的选择。

I'm planning on then doing something like this convert. 我打算然后做这样的转换。

<?php

/* Example data */
$string[] = 'hello';
$string[] = 'asdfghjkl;qwertyuiop[]\zxcvbnm,./]12345657890-=+_)(*&^%$#@!';
$string[] = '';
$string[] = 'accentué';
$string[] = '»á½µÎ½Ï‰Î½ Ï„á½° ';
$string[] = '???R??=8 ????? ++++¦??? ???2??????';
$string[] = 'hello¦ùó 5/5¡45-52ZÜ¿»'. "0x93". octdec('77'). decbin(26). "F???pp?? ??? ". '»á½µÎ½Ï‰Î½ Ï„á½° ';


$time = microtime(true);

//Count the successes
$true = array(1 => 0, 0 => 0);

foreach($string as $s) {
    $r = seems_utf8($s);    //0.011

    print_pre(mb_substr($s, 0, 30). ' is '. ($r ? 'UTF-8' : 'non-UTF-8'));


    if( ! $r ) {

        $e = mb_detect_encoding($s, "auto");

        print_pre('Encoding: '. $e);

        //Convert
        $s = iconv($e, 'UTF-8//TRANSLIT', $s);

        print_pre(mb_substr($s, 0, 30). ' is now '. (seems_utf8($s) ? 'valid' : 'not'). ' UTF-8');
    }

}

print_pre($true);
print_pre((microtime(TRUE) - $time). ' seconds');

function print_pre() { print '<pre>'; print_r(func_get_args()); print '</pre>'; }

Making the choice between ASCII and UTF8 based on performance is probably the wrong approach. 根据性能在ASCII和UTF8之间进行选择可能是错误的方法。 The answer really depends on your use case. 答案确实取决于您的用例。 If your string needs to support internationalization, you most likely go with UTF8. 如果您的字符串需要支持国际化,则很可能会使用UTF8。 If your site is english only, you might go with ASCII. 如果您的站点仅是英语,则可以使用ASCII。 Or maybe you still go with UTF8. 也许您仍然选择UTF8。 Whatever you choose, it should probably match the character encoding you set for the HTML form you serve to solicit the input from your user. 无论您选择什么,它都应该与为服务于HTML表单设置的字符编码相匹配,以请求用户输入。

I'm not sure how necessary parts of this approach are. 我不确定这种方法的必要性。 If you ask the user for UTF-8 input, and they give you "something else" throw it away and ask again. 如果您要求用户提供UTF-8输入,并且他们给您“其他”信息,请将其丢弃并再次询问。

The various character set detecting functions out there are universally (and tragically, necessarily) imperfect. 那里的各种字符集检测功能普遍(并且很可悲地是)不完善。 The ones in the MB library as well as the ones in iconv aren't even that advanced compared to some of the stuff that's out there. 与库中的某些内容相比,MB库中的内容以及iconv中的内容都没有那么先进。 The mb_detect_encoding basically iterates through a list of character sets and returns the first one that makes the string it has in hand look valid. mb_detect_encoding基本上会遍历一个字符集列表,并返回第一个字符集,从而使它手中的字符串看起来有效。 In this day and age it's probably that several would return true (which is why the ordering is exposed through mb_detect_order()). 在当今时代,可能会有几个返回true(这就是为什么通过mb_detect_order()公开顺序的原因)。

Ensure your pages are provided with the correct HTTP & HTML character set declarations, and browsers should return data in the same. 确保为您的页面提供了正确的HTTP和HTML字符集声明,并且浏览器应以相同的方式返回数据。 To be extra specific include the accept-charset declaration in your form tag. 具体而言,在表单标签中包含accept-charset声明。 I've yet to discover a case where this was ignored that didn't represent an attack. 我还没有发现忽略这种情况并不代表攻击的情况。

To check the encoding of a byte stream, you can simply use mb_check_encoding(). 要检查字节流的编码,只需使用mb_check_encoding()。

I'm assuming what you're doing is checking that the iconv seems necessary before executing it? 我假设您正在执行的操作是在执行iconv之前检查它是否必要?

If you don't expect a very frequent occurrence of non-ASCII characters, then is_ascii seems like it would be the most efficient approach. 如果您不希望出现非常频繁的非ASCII字符,则is_ascii似乎是最有效的方法。 The iconv would only need to be triggered if a > 7-bit character was encountered. iconv仅在遇到> 7位字符时才需要触发。

If there are likely to be high-bit characters in the checked string, then seems_utf8 might be more efficient, you will need to call iconv a lot less unless there's also a high frequency of high-bit but non-UTF8 characters. 如果在检查的字符串中可能有高位字符,则似乎see_utf8会更有效,您将需要更少地调用iconv,除非也有高频率的高位字符但非UTF8字符。

If you are just trying to protect your inputs so they accept only UTF-8, I think you can just use mb_check_encoding . 如果您只是想保护您的输入,使其仅接受UTF-8,我想您可以使用mb_check_encoding Something like this : 像这样的东西:

if(!mb_check_encoding($input, 'UTF-8'){
  die('Non UTF-8 character found');
}

should be enough to reject any invalid input. 应该足以拒绝任何无效输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM