[英]The most performant way to check if a string is blank (i.e. only contains whitespace) in JavaScript?
I need to write a function which tests, if given string is "blank" in a sense that it only contains whitespace characters. 我需要编写一个测试函数,如果给定的字符串是“空白”,在某种意义上它只包含空白字符。 Whitespace characters are the following:
空白字符如下:
'\u0009',
'\u000A',
'\u000B',
'\u000C',
'\u000D',
' ',
'\u0085',
'\u00A0',
'\u1680',
'\u180E',
'\u2000',
'\u2001',
'\u2002',
'\u2003',
'\u2004',
'\u2005',
'\u2006',
'\u2007',
'\u2008',
'\u2009',
'\u200A',
'\u2028',
'\u2029',
'\u202F',
'\u205F',
'\u3000'
The function will be called a lot of times, so it must be really, really performant. 该函数将被调用很多次,因此它必须真正,真正高效。 But shouldn't take too much memory (like mapping every character to true/false in an array).
但是不应该占用太多内存(比如将每个字符映射到数组中的true / false)。 Things I've tried out so far:
到目前为止我尝试过的事情:
if (!whitespaceCharactersMap[str[index]]) ...
) - works well enough if (!whitespaceCharactersMap[str[index]]) ...
) - 运行良好 my current solution uses hardcoded comparisons: 我目前的解决方案使用硬编码比较:
function(str) { var length = str.length; if (!length) { return true; } for (var index = 0; index < length; index++) { var c = str[index]; if (c === ' ') { // skip } else if (c > '\
' && c < '\
') { return false; } else if (c < '\ ') { if (c < '\ ') { return false; } else if (c > '\
') { return false; } } else if (c > '\ ') { if (c < '\
') { if (c < '\') { if (c < '\ ') { return false; } else if(c > '\ ') { return false; } } else if (c > '\') { if (c < '\ ') { return false; } else if (c > '\ ') { return false; } } } else if (c > '\
') { if (c < '\ ') { if (c < '\ ') { return false; } else if (c > '\ ') { return false; } } else if (c > '\ ') { if (c < '\ ') { return false; } else if (c > '\ ') { return false; } } } } } return true; }
This seems to work 50-100% faster than hash set (tested on Chrome). 这似乎比哈希集(在Chrome上测试)快50-100%。
Does anybody see or know further options? 有人看到或知道更多选择吗?
Update 1 更新1
I'll answer some of the comments here: 我会在这里回答一些评论:
Now here's my take on performance tests: 现在,这是我对性能测试的看法:
http://jsperf.com/hash-with-comparisons/6 http://jsperf.com/hash-with-comparisons/6
I'd be grateful if you guys run these tests a couple of times. 如果你们经常进行这些测试,我将不胜感激。
Preliminary conclusions: 初步结论:
a^9*a^10*a^11...
) is extremely fast in Chrome and Firefox, but not in Safari. a^9*a^10*a^11...
)在Chrome和Firefox中非常快,但在Safari中却没有。 Probably the best choice for Node.js from performance perspective. ' '
should be the first one). ' '
应该是第一个)。 To sum up, for my case I'll opt to the following regexp version: 总而言之,对于我的情况,我将选择以下正则表达式版本:
var re = /[^\s]/;
return !re.test(str);
Reasons: 原因:
Hard-coded solution seems the best, but I think switch
should be faster. 硬编码解决方案似乎是最好的,但我认为
switch
应该更快。 It depends on the way JavaScript interpreter handles these (most compilers do this very efficiently), so it may be browser-specific (ie, fast in some, slow in others). 这取决于JavaScript解释器处理这些的方式(大多数编译器非常有效地执行此操作),因此它可能是特定于浏览器的(即,某些编译器速度快,而其他编译器速度慢)。 Also, I'm not sure how fast JavaScript is with UTF-strings, so you might try converting a character to its integer code before comparing the values.
此外,我不确定JavaScript对UTF字符串的速度有多快,因此您可以尝试在比较值之前将字符转换为整数代码。
for (var index = 0; index < length; index++)
{
var c = str.charCodeAt(index);
switch (c) {
case 0x0009: case 0x000A: case 0x000B: case 0x000C: case 0x000D: case 0x0020:
case 0x0085: case 0x00A0: case 0x1680: case 0x180E: case 0x2000: case 0x2001:
case 0x2002: case 0x2003: case 0x2004: case 0x2005: case 0x2006: case 0x2007:
case 0x2008: case 0x2009: case 0x200A: case 0x2028: case 0x2029: case 0x202F:
case 0x205F: case 0x3000: continue;
}
return false;
}
Another thing to consider is changing for
: 另一个要考虑的是改变
for
:
for (var index in str)
{
...
}
Edit 编辑
Your jsPerf test got some revisions, the current one available here . 你的jsPerf测试得到了一些修改,现在可以在这里修改 。 My code is significantly faster in Chrome 26 and 27, and in IE10, but it's also the slowest one in Firefox 18.
我的代码在Chrome 26和27以及IE10中明显更快,但它也是Firefox 18中最慢的代码。
I ran the same test (I don't know how to make jsPerf save those) on Firefox 20.0 on 64-bit Linux and it turned out to be one of the two fastest ones (tied with trimTest
, both at about 11.8M ops/sec). 我在64位Linux上的Firefox 20.0上运行了相同的测试(我不知道如何使jsPerf保存那些),结果发现它是两个最快的测试之一(与
trimTest
,两者都在大约11.8M ops /秒)。 I also tested Firefox 20.0.1 on WinXP , but under a VirtualBox (still under 64bit Linux, which might make a significant difference here), which gave 10M ops/sec to switchTest
, with trimTest
coming second at 7.3M ops/sec. 我还在WinXP上测试了Firefox 20.0.1 ,但是在VirtualBox下(仍然在64位Linux下,这可能会产生显着的差异),这给了
switchTest
10M ops / sec,其中trimTest
以switchTest
ops / sec的速度获得第二。
So, I'm guessing that the performance depends on the browser version and/or maybe even on the underlying OS/hardware (I suppose the above FF18 test was on Win). 所以,我猜测性能取决于浏览器版本和/或甚至可能在底层OS /硬件上(我认为上面的FF18测试是在Win上)。 In any case, to make a truly optimal version, you'll have to make many versions, test each on all browsers, OSes, architectures,... you can get a hold of, and then include in your page the version best suited for the visitor's browser, OS, architecture,... I'm not sure what kind of code is worth the trouble, though.
在任何情况下,要制作一个真正优化的版本,你必须制作许多版本,在所有浏览器,操作系统,架构上测试每个版本......你可以掌握,然后在你的页面中包含最适合的版本对于访问者的浏览器,操作系统,架构,......我不确定哪种代码值得麻烦。
Since branching is much more expensive than most other operations, you want to keep branches to a minimum. 由于分支比大多数其他操作昂贵得多,因此您希望将分支保持在最低限度。 Thus, your sequence of if/else statements may not be very performant.
因此,您的if / else语句序列可能不是非常高效。 A method which instead uses mostly math would be a lot faster.
一种主要使用数学的方法会快得多。 For example:
例如:
One way of performing an equality check without using any branching is to use bitwise operations. 在不使用任何分支的情况下执行相等性检查的一种方法是使用按位运算。 One example is, to check that a == b:
一个例子是,检查a == b:
a ^ b == 0
Since the xor of two similar bits (ie, 1 ^ 1 or 0 ^ 0) is 0, xor-ing two equal values produces 0. This is useful because it allows us to treat 0 as a "true" value, and do more math. 由于两个相似位(即1 ^ 1或0 ^ 0)的xor为0,因此xor-two两个相等的值产生0.这很有用,因为它允许我们将0视为“真”值,并执行更多操作数学。 Imagine that we have a bunch of boolean variables represented in this way: nonzero numbers are false, and zero means true.
想象一下,我们有一堆以这种方式表示的布尔变量:非零数字为假,零意味着为真。 If we want to ask, "is any of these true?"
如果我们想问,“这些都是真的吗?” we simply multiply them all together.
我们简单地将它们相乘。 If any of them were true (equal to zero), the entire result would be zero.
如果它们中的任何一个为真(等于零),则整个结果将为零。
So, for example, the code would look something like this: 因此,例如,代码看起来像这样:
function(str) {
for (var i = 0; i < str.length; i++) {
var c = str[i];
if ((c ^ '\u0009') * (c ^ '\u000A') * (c ^ '\u000B') ... == 0)
continue;
return false;
}
return true;
}
The primary reason that this would be more performant than simply doing something like: 这样做的主要原因是比仅执行以下操作更具性能:
if ((c == '\u0009') || (c == '\u000A') || (c == '\u0008') ...)
is that JavaScript has short-circuit boolean operators, meaning that every time the ||
是JavaScript有短路布尔运算符,意味着每次都是
||
operator is used, it not only performs the or operation, but also checks to see if it can prove that the statement must be true thus far, which is a branching operation, which is expensive. 运算符被使用,它不仅执行或操作,而且还检查它是否可以证明该语句到目前为止必须为真,这是一个昂贵的分支操作。 The math approach, on the other hand, involves no branching, except for the if statement itself, and should thus be much faster.
另一方面,数学方法不涉及分支,除了if语句本身,因此应该更快。
This creates and uses a 'hash' lookup on the characters of the string, if it detects a non-whitespace then returns false: 这会在字符串的字符上创建并使用'hash'查找,如果它检测到非空格,则返回false:
var wsList=['\u0009','\u000A','\u000B','\u000C','\u000D',' ','\u0085','\u00A0','\u1680','\u180E','\u2000','\u2001','\u2002','\u2003','\u2004','\u2005','\u2006','\u2007','\u2008','\u2009','\u200A','\u2028','\u2029','\u202F','\u205F','\u3000'];
var ws=Object.create(null);
wsList.forEach(function(char){ws[char]=true});
function isWhitespace(txt){
for(var i=0, l=txt.length; i<l; ++i){
if(!ws[txt[i]])return false;
}
return true;
}
var test1=" \u1680 \u000B \u2002 \u2004";
isWhitespace(test1);
/*
true
*/
var test2=" _ . a ";
isWhitespace(test2);
/*
false
*/
Not sure about it's performance
(yet)
. 不确定它的性能
(还)
。 After a quick test on jsperf, it turns out to be quite slow compared to RegExp using /^\\s*$/
. 在对jsperf进行快速测试之后,与使用
/^\\s*$/
RegExp相比,它变得非常慢。
edit: 编辑:
It appears that the solution you should go with might likely depend on the nature of the data you are working with: Is the data mostly whitespace or mostly non-whitespace? 您应该使用的解决方案似乎可能取决于您正在使用的数据的性质:数据主要是空白还是大多数非空白? Also mostly ascii-range text?
也主要是ascii范围文本? You might be able to speed it up for average test cases by using range checks (via
if
) for common non-whitespace character ranges, using switch
on the most common whitespace, then using a hash lookup for everything else. 您可以通过对常见的非空白字符范围使用范围检查(通过
if
),使用最常见的空格上的switch
,然后对其他所有内容使用哈希查找来加快平均测试用例的速度。 This will likely improve average performance of the tests if most of the data being tested is comprised of the most common characters (between 0x0--0x7F). 如果测试的大多数数据由最常见的字符组成(在0x0--0x7F之间),这可能会提高测试的平均性能。
Maybe something like this (a hybrid of if/switch/hash) could work: 也许像这样(if / switch / hash的混合)可以工作:
/*same setup as above with variable ws being a hash lookup*/
function isWhitespaceHybrid(txt){
for(var i=0, l=txt.length; i<l; ++i){
var cc=txt.charCodeAt(i)
//above space, below DEL
if(cc>0x20 && cc<0x7F)return false;
//switch only the most common whitespace
switch(cc){
case 0x20:
case 0x9:
case 0xA:
case 0xD:
continue;
}
//everything else use a somewhat slow hash lookup (execute for non-ascii range text)
if(!ws[txt[i]])return false;
}
return true;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.