[英]Why is JSON::XS Not Generating Valid UTF-8?
I'm getting some corrupted JSON and I've reduced it down to this test case. 我收到了一些损坏的JSON,我把它简化为这个测试用例。
use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;
BEGIN {
# damn it
my $builder = Test::Builder->new;
foreach (qw/output failure_output todo_output/) {
binmode $builder->$_, ':encoding(UTF-8)';
}
}
foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
my $hashref = { value => $string };
is_sane_utf8 $string, "String: $string";
my $json = encode_json($hashref);
is_sane_utf8 $json, "JSON: $json";
say STDERR $json;
}
diag ord('»');
done_testing;
And this is the output: 这是输出:
utf8.t ..
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver «French Bread»"}
# Failed test 'JSON: {"value":"Deliver «French Bread»"}'
# at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. 所以包含guillemets(«»)的字符串是有效的UTF-8,但生成的JSON不是。 What am I missing? 我错过了什么? The utf8
pragma is correctly marking my source. utf8
pragma正确标记了我的源代码。 Further, that trailing 187 is from the diag. 此外,尾随187来自诊断。 That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. 这不到255,所以它几乎看起来像Perl中旧的Unicode bug的变种。 (And the test output still looks like crap. Never could quite get that right with Test::Builder). (并且测试输出仍然看起来像废话。使用Test :: Builder永远不能完全正确)。
Switching to JSON::PP
produces the same output. 切换到JSON::PP
会产生相同的输出。
This is Perl 5.18.1 running on OS X Yosemite. 这是在OS X Yosemite上运行的Perl 5.18.1。
is_sane_utf8
doesn't do what you think it does. is_sane_utf8
不会按照您的想法执行操作。 You're suppose to pass strings you've decoded to it. 您可能希望将已解码的字符串传递给它。 I'm not sure what's the point of it, but it's not the right tool. 我不确定它的重点是什么,但它不是正确的工具。 If you want to check if a string is valid UTF-8, you could use 如果要检查字符串是否有效UTF-8,可以使用
ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
'$string is valid UTF-8');
To show that JSON::XS is correct, let's look at the sequence is_sane_utf8
flagged. 为了表明JSON :: XS是正确的,让我们看一下标记的is_sane_utf8
序列。
+--------------------- Start of two byte sequence
| +---------------- Not zero (good)
| | +---------- Continuation byte indicator (good)
| | |
v v v
C2 AB = [110]00010 [10]101011
00010 101011 = 000 1010 1011 = U+00AB = «
The following shows that JSON::XS produces the same output as Encode.pm: 以下显示JSON :: XS生成与Encode.pm相同的输出:
use utf8;
use 5.18.0;
use JSON::XS;
use Encode;
foreach my $string ('Deliver «French Bread»', '日本国') {
my $hashref = { value => $string };
say(sprintf("Input: U+%v04X", $string));
say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));
my $json = encode_json($hashref);
say(sprintf("JSON: %v02X", $json));
say("");
}
Output (with some spaces added): 输出(添加了一些空格):
Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D
Input: U+65E5.672C.56FD
UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D
JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings. JSON :: XS生成有效的UTF-8,但是您在两个不同的上下文中使用生成的UTF-8编码的字节字符串,这些字符串需要字符串。
Here are the two main situations when is_sane_utf8
will fail: 以下是is_sane_utf8
失败时的两种主要情况:
«French Bread»
. 您有一个有效的 UTF-8字节字符串,其中包含编码的代码点U + 0080到U + 00FF,例如«French Bread»
。 The is_sane_utf8
test is intended only for character strings and has the documented potential for false negatives. is_sane_utf8
测试仅适用于字符串,并且具有记录的漏报可能性。
All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. 所有非JSON字符串都是字符串,而JSON字符串是UTF-8编码的字节字符串,从JSON编码器返回。 Since you're using the :encoding(UTF-8)
PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. 由于您使用:encoding(UTF-8)
PerlIO层进行TAP输出,因此字符串被隐式编码为UTF-8,结果良好,而包含JSON的字节字符串则被双重编码。 STDERR however does not have an :encoding
PerlIO layer set, so the encoded JSON byte strings look good in your warn
ings since they're already encoded and being passed straight out. 然而,STDERR没有:encoding
PerlIO层集,因此编码的JSON字节字符串在您的warn
看起来很好,因为它们已经被编码并直接传递出去。
Only use the :encoding(UTF-8)
PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder. 仅对带有字符串的IO使用:encoding(UTF-8)
PerlIO层,而不是默认从JSON编码器返回的UTF-8编码字节字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.