为什么JSON :: XS不生成有效的UTF-8？

Question

I'm getting some corrupted JSON and I've reduced it down to this test case. 我收到了一些损坏的JSON，我把它简化为这个测试用例。

use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;

BEGIN {
    # damn it
    my $builder = Test::Builder->new;
    foreach (qw/output failure_output todo_output/) {
        binmode $builder->$_, ':encoding(UTF-8)';
    }
}

foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
    my $hashref = { value => $string };
    is_sane_utf8 $string, "String: $string";
    my $json = encode_json($hashref);
    is_sane_utf8 $json, "JSON: $json";
    say STDERR $json;
}
diag ord('»');

done_testing;

And this is the output: 这是输出：

utf8.t .. 
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver Â«French BreadÂ»"}

#   Failed test 'JSON: {"value":"Deliver Â«French BreadÂ»"}'
#   at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}    
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187

So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. 所以包含guillemets（«»）的字符串是有效的UTF-8，但生成的JSON不是。 What am I missing? 我错过了什么？ The utf8 pragma is correctly marking my source. utf8 pragma正确标记了我的源代码。 Further, that trailing 187 is from the diag. 此外，尾随187来自诊断。 That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. 这不到255，所以它几乎看起来像Perl中旧的Unicode bug的变种。 (And the test output still looks like crap. Never could quite get that right with Test::Builder). （并且测试输出仍然看起来像废话。使用Test :: Builder永远不能完全正确）。

Switching to JSON::PP produces the same output. 切换到JSON::PP会产生相同的输出。

This is Perl 5.18.1 running on OS X Yosemite. 这是在OS X Yosemite上运行的Perl 5.18.1。

Answer 1

is_sane_utf8 doesn't do what you think it does. is_sane_utf8不会按照您的想法执行操作。 You're suppose to pass strings you've decoded to it. 您可能希望将已解码的字符串传递给它。 I'm not sure what's the point of it, but it's not the right tool. 我不确定它的重点是什么，但它不是正确的工具。 If you want to check if a string is valid UTF-8, you could use 如果要检查字符串是否有效UTF-8，可以使用

ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
   '$string is valid UTF-8');

To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged. 为了表明JSON :: XS是正确的，让我们看一下标记的is_sane_utf8序列。

          +--------------------- Start of two byte sequence
          |    +---------------- Not zero (good)     
          |    |     +---------- Continuation byte indicator (good)
          |    |     |
          v    v     v
C2 AB = [110]00010 [10]101011

             00010     101011 = 000 1010 1011 = U+00AB = «

The following shows that JSON::XS produces the same output as Encode.pm: 以下显示JSON :: XS生成与Encode.pm相同的输出：

use utf8;
use 5.18.0;
use JSON::XS;
use Encode;

foreach my $string ('Deliver «French Bread»', '日本国') {
    my $hashref = { value => $string };
    say(sprintf("Input: U+%v04X", $string));
    say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));

    my $json = encode_json($hashref);
    say(sprintf("JSON: %v02X", $json));
    say("");
}

Output (with some spaces added): 输出（添加了一些空格）：

Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input:                     44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D

Input: U+65E5.672C.56FD
UTF-8 of input:                     E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D

Answer 2

JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings. JSON :: XS生成有效的UTF-8，但是您在两个不同的上下文中使用生成的UTF-8编码的字节字符串，这些字符串需要字符串。

Issue 1: Test::utf8 问题1：测试:: utf8

Here are the two main situations when is_sane_utf8 will fail: 以下是is_sane_utf8失败时的两种主要情况：

You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs). 您有一个错误编码的字符串，它已经从UTF-8字节字符串解码，就像它是Latin-1或双重编码的UTF-8，或者字符串完全正常，看起来像一个潜在的 “狡猾”错误编码（使用来自其文档的术语）。
You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread» . 您有一个有效的 UTF-8字节字符串，其中包含编码的代码点U + 0080到U + 00FF，例如«French Bread» 。

The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives. is_sane_utf8测试仅适用于字符串，并且具有记录的漏报可能性。

Issue 2: Output Encoding 问题2：输出编码

All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. 所有非JSON字符串都是字符串，而JSON字符串是UTF-8编码的字节字符串，从JSON编码器返回。 Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. 由于您使用:encoding(UTF-8) PerlIO层进行TAP输出，因此字符串被隐式编码为UTF-8，结果良好，而包含JSON的字节字符串则被双重编码。 STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warn ings since they're already encoded and being passed straight out. 然而，STDERR没有:encoding PerlIO层集，因此编码的JSON字节字符串在您的warn看起来很好，因为它们已经被编码并直接传递出去。

Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder. 仅对带有字符串的IO使用:encoding(UTF-8) PerlIO层，而不是默认从JSON编码器返回的UTF-8编码字节字符串。

为什么JSON :: XS不生成有效的UTF-8？

问题描述

2 个解决方案

解决方案1
13 2014-12-06 20:36:47

解决方案2
4 2014-12-06 22:33:50

Issue 1: Test::utf8 问题1：测试:: utf8

Issue 2: Output Encoding 问题2：输出编码

为什么JSON :: XS不生成有效的UTF-8？

问题描述

2 个解决方案

解决方案1 13 2014-12-06 20:36:47

解决方案2 4 2014-12-06 22:33:50

Issue 1: Test::utf8 问题1：测试:: utf8

Issue 2: Output Encoding 问题2：输出编码

解决方案1
13 2014-12-06 20:36:47

解决方案2
4 2014-12-06 22:33:50