简体   繁体   English

为什么Closure Compiler坚持添加更多字节?

[英]Why does Closure Compiler insist on adding more bytes?

If I give Closure Compiler something like this: 如果我给Closure Compiler这样的话:

window.array = '0123456789'.split('');

It "compiles" it to this: 它“编译”它:

window.array="0,1,2,3,4,5,6,7,8,9".split(",");

Now as you can tell, that's bigger. 现在你可以说,这更大。 Is there any reason why Closure Compiler is doing this? Closure Compiler是否有任何理由这样做?

I think this is what's going on, but I am by no means certain... 认为这是正在发生的事情,但我不确定......

The code that causes the insertion of commas is tryMinimizeStringArrayLiteral in PeepholeSubstituteAlternateSyntax.java . 使逗号的插入的代码是tryMinimizeStringArrayLiteralPeepholeSubstituteAlternateSyntax.java

That method contains a list of characters that are likely to have a low Huffman encoding , and are therefore preferable to split on than other characters. 该方法包含可能具有低霍夫曼编码的字符列表,因此比其他字符更适合拆分。 You can see the result of this if you try something like this: 如果你尝试这样的话,你可以看到这个的结果:

"a b c d e f g".split(" "); //Uncompiled, split on spaces
"a,b,c,d,e,f,g".split(","); //Compiled, split on commas (same size)

The compiler will replace the character you try to split on with one it thinks is favourable. 编译器将用你认为有利的字符替换你试图拆分的字符。 It does so by iterating over the characters of the string and finding the most favourable splitting character that does not occur within the string: 它通过迭代字符串的字符并找到字符串中不存在的最有利的分裂字符来实现:

// These delimiters are chars that appears a lot in the program therefore
// probably have a small Huffman encoding.
NEXT_DELIMITER: for (char delimiter : new char[]{',', ' ', ';', '{', '}'}) {
  for (String cur : strings) {
    if (cur.indexOf(delimiter) != -1) {
      continue NEXT_DELIMITER;
    }
  }
  String template = Joiner.on(delimiter).join(strings);
  //...
}

In the above snippet you can see the array of characters the compiler claims to be optimal to split on. 在上面的代码片段中,您可以看到编译器声称最适合拆分的字符数组。 The comma is first (which is why in my space example above, the spaces have been replaced by commas). 逗号是第一个(这就是为什么在上面的空间示例中,空格已被逗号替换)。

I believe the insertion of commas in the case where the string to split on is the empty string may simply be an oversight. 我相信在要拆分的字符串是空字符串的情况下插入逗号可能只是一个疏忽。 There does not appear to be any special treatment of this case, so it's treated like any other split call and each character is joined with the first appropriate character from the array shown in the above snippet. 似乎没有对这种情况进行任何特殊处理,因此它被视为任何其他split调用,并且每个字符都与上述代码段中显示的数组中的第一个相应字符连接在一起。


Another example of how the compiler deals with the split method: 编译器如何处理split方法的另一个例子:

"a,;b;c;d;e;f;g".split(";"); //Uncompiled, split on semi-colons
"a, b c d e f g".split(" "); //Compiled, split on spaces

This time, since the original string already contains a comma (and we don't want to split on the comma character), the comma can't be chosen from the array of low-Huffman-encoded characters, so the next best choice is selected (the space). 这一次,因为原始字符串已经包含逗号(并且我们不想在逗号字符上拆分),所以不能从低霍夫曼编码的字符数组中选择逗号,因此下一个最佳选择是选中(空间)。


Update 更新

Following some further research into this, it is definitely not a bug. 在对此进行进一步研究之后,它绝对不是一个错误。 This behaviour is actually by design, and in my opinion it's a very clever little optimisation, when you bear in mind that the Closure compiler tends to favour the speed of the compiled code over size. 这种行为实际上是设计的,在我看来,这是一个非常聪明的小优化,当你记住Closure编译器倾向于支持编译代码的速度超过大小。

Above I mentioned Huffman encoding a couple of times. 上面我提到了霍夫曼编码几次。 The Huffman coding algorithm, explained very simply, assigns a weight to each character appearing the the text to be encoded. 非常简单地解释的霍夫曼编码算法为出现在要编码的文本中的每个字符分配权重。 The weight is based on the frequency with which each character appears. 权重取决于每个角色出现的频率。 These frequencies are used to build a binary tree, with the most common character at the root. 这些频率用于构建二叉树,其中最常见的字符位于根。 That means the most common characters are quicker to decode, since they are closer to the root of the tree. 这意味着最常见的字符可以更快地解码,因为它们更接近树的根。

And since the Huffman algorithm is a large part of the DEFLATE algorithm used by gzip. 由于霍夫曼算法是gzip使用的DEFLATE算法的很大一部分。 So if your web server is configured to use gzip, your users will be benefiting from this clever optimisation. 因此,如果您的Web服务器配置为使用gzip,您的用户将从这个聪明的优化中受益。

此问题已于2012年4月20日修复,请参阅修订版: https//code.google.com/p/closure-compiler/source/detail?r = 1267364f742588a835d78808d0eef8c9f8ba8161

Ironically, split in the compiled code has nothing to do with split in the source. 讽刺的是, split的编译代码无关与split的来源。 Consider: 考虑:

Source  : a = ["0","1","2","3","4","5"]
Compiled: a="0,1,2,3,4,5".split(",")

Here, split is just a way to represent long arrays (long enough for sum of all quotes + commas to be longer than split(","") ). So, what's going on in your example? First, the compiler sees a string function applied to a constant and evaluates it right away: 这里, split只是一种表示长数组的方法(长度足以表示所有引号的总和+逗号长于split(","") )。那么,你的例子中发生了什么?首先,编译器看到一个字符串函数应用于常量并立即对其求值:

'0123456789'.split('') => ["0","1","2","3","4","5","6","7","8","9"]

At some later point, when generating output, the compiler considers this array to be "long" and writes it in the above "split" form: 稍后,在生成输出时,编译器将此数组视为“long”并将其写入上面的“split”形式:

["0","1","2","3","4","5","6","7","8","9"] => "0,1,2,3,4,5,6,7,8,9".split(",")

Note that all information about split('') in the source is already lost at this point. 请注意,此时有关源中split('')所有信息都已丢失。

If the source string were shorter, it would be generated in the array array form, without extra splitting: 如果源字符串较短,它将以数组数组的形式生成,无需额外拆分:

Source  : a = '0123'.split('')
Compiled: a=["0","1","2","3"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM