[英]Why is iconv generating an illegal character error?
I'm trying to iron out the warnings and notices from a script. 我正试图从剧本中删除警告和通知。 The script includes the following:
该脚本包括以下内容:
$clean_string = iconv('UTF-8', 'UTF-8//IGNORE', $supplier.' => '.$product_name);
As I understand it, the purpose of this line, as intended by the original author of the script, is to remove non-UTF-8 characters from the string, but obviously any non-UTF-8 characters in the input will cause iconv to throw an illegal character warning. 据我了解,该行的目的是按照脚本原作者的意图,从字符串中删除非UTF-8字符,但显然输入中的任何非UTF-8字符都会导致iconv抛出非法的人物警告。
To solve this, my idea was to do something like the following: 为了解决这个问题,我的想法是做类似以下的事情:
$clean_string = iconv(mb_detect_encoding($supplier.' => '.$product_name), 'UTF-8//IGNORE', $supplier.' => '.$product_name);
Oddly however, mb_detect_encoding() is returning UTF-8
as the detected encoding! 但奇怪的是,mb_detect_encoding()返回
UTF-8
作为检测到的编码!
The letter e with an accent ( é
) is an example of a character that causes this behaviour. 带有重音(
é
)的字母e是导致此行为的字符的示例。
I realise I'm mixing multibyte libraries between detection and conversion, but I couldn't find an encoding detection function in the iconv library. 我意识到我在检测和转换之间混合使用多字节库,但我在iconv库中找不到编码检测功能。
I've considered using the mb_convert_encoding() function to clean the string up into UTF-8, but the PHP documentation isn't clear what happens to characters that cannot be represented. 我已经考虑过使用mb_convert_encoding()函数将字符串清理成UTF-8,但PHP文档并不清楚无法表示的字符会发生什么。
I am using PHP 5.2.17, and with the glibc iconv implementation, library version 2.5. 我使用PHP 5.2.17,并使用glibc iconv实现,库版本2.5。
Can anyone offer any suggestions on how to clean the string into UTF-8, or insight into why this behaviour occurs? 任何人都可以提供有关如何将字符串清理为UTF-8的任何建议,或者了解为什么会出现这种情况?
Your example: 你的例子:
$string = $supplier . ' => ' . $product_name;
$stringUtf8 = iconv('UTF-8', 'UTF-8//IGNORE', $string);
and using PHP 5.2 might work for you. 并使用PHP 5.2可能适合您。 In later PHP versions, if the input is not precisely UTF-8, incov will drop the string (you will get an empty string).
在以后的PHP版本中,如果输入不是精确的UTF-8,incov将丢弃该字符串(您将获得一个空字符串)。 That so far as a note to you, you might not be aware of it.
到目前为止,你可能没有注意到它。
Then you try with mb_detect_encoding
Docs to find out about the original encoding: 然后,您尝试使用
mb_detect_encoding
文档来查找原始编码:
$string = $supplier . ' => ' . $product_name;
$encoding = mb_detect_encoding($string);
$stringUtf8 = iconv($encoding, 'UTF-8//IGNORE', $string);
As I already linked in a comment, mb_detect_encoding
is doing some magic and can not work. 正如我已在评论中链接的那样,
mb_detect_encoding
正在做一些魔术并且无法正常工作。 It tries to help you, however, it can not detect the encoding very good. 它试图帮助你,但它无法检测到编码非常好。 This is by matters of the subject.
这是主题的问题。 You can try to set the strict mode to true:
您可以尝试将严格模式设置为true:
$order = mb_detect_order();
$encoding = mb_detect_encoding($string, $order, true);
if (FALSE === $encoding) {
throw new UnexpectedValueException(
sprintf(
'Unable to detect input encoding with mb_detect_encoding, order was: %s'
, print_r($order, true)
)
);
}
Next to that you might also need to translate the names of the encoding Docs (and/or validate against supported encoding) between the two libraries (iconv and multi byte strings). 接下来,您可能还需要在两个库(iconv和多字节字符串)之间转换编码Docs的名称 (和/或对支持的编码进行验证)。
Hope this helps so that you at least do better understand why some things might not work and how you can better find the error-cases and filter the input then with the standard PHP extensions. 希望这有助于您至少更好地理解为什么有些东西可能不起作用以及如何更好地找到错误情况并使用标准PHP扩展来过滤输入。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.