JavaScript：避免使用String.split的空字符串和正则表达式优先级

Question

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. 我正在创建一个语法高亮显示器，我使用String.split从输入字符串创建令牌。 The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be. 第一个问题是String.split创建了大量的空字符串，这导致一切都比它原本要慢得多。

For example, "***".split(/(\\*)/) -> ["", "*", "", "*", "", "*", ""] . 例如， "***".split(/(\\*)/) - > ["", "*", "", "*", "", "*", ""] 。 Is there a way to avoid this? 有办法避免这种情况吗？

Another issue is the expression precedence in the regular expression itself. 另一个问题是正则表达式本身的表达式优先级。 Let's say I am trying to parse a C style multi-line comment. 假设我正在尝试解析C风格的多行注释。 That is, /* comment */ . 也就是/* comment */ 。 Now let's assume the input string is "/****/" . 现在让我们假设输入字符串是"/****/" 。 If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!). 如果我使用下面的正则表达式，它会起作用，但产生许多额外的标记（以及所有那些空字符串！）。

/(\/\*|\*\/|\*)/

A better way is to read /* 's, */ 's and then read all the rest of the * 's in one token. 更好的方法是读取/* ， */ ，然后在一个标记中读取所有其余的* 。 That is, the better result for the above string is ["/*", "**", "*/"] . 也就是说，上述字符串的更好结果是["/*", "**", "*/"] 。 However, when using the regular expression that should do this, I get bad results. 但是，当使用应该执行此操作的正则表达式时，我会得到错误的结果。 The regular expression is like so: /(\\/\\*|\\*\\/|\\*+)/ . 正则表达式如下： /(\\/\\*|\\*\\/|\\*+)/ 。

The result of this expression is however this: ["/*", "***", "/"] . 然而，这个表达式的结果是： ["/*", "***", "/"] 。 I am guessing this is because the last part is greedy so it steals the match from the other part. 我猜这是因为最后一部分是贪婪的，所以它从其他部分窃取了比赛。

The only solution I found was to make a negated lookahead expression, like this: 我找到的唯一解决方案是制作一个否定的前瞻表达式，如下所示：

/(\/\*|\*\/|\*+(?!\/)/

This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings. 这给出了预期的结果，但与其他结果相比它非常慢，这对大字符串有影响。

Is there a solution for either of these problems? 是否有解决这些问题的方法？

Answer 1

Use lookahed to avoid empty matches: 使用lookahed来避免空匹配：

arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]

OR use filter(Boolean) to discard empty matches: 或使用filter(Boolean)来丢弃空匹配：

arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]

Answer 2

Generally for tokenizing you use match , not split : 通常用于标记化您使用match ，而不是split ：

> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]

Also note how the non-greedy modifier ? 还要注意非贪心修饰符? solves the second problem. 解决了第二个问题。

JavaScript：避免使用String.split的空字符串和正则表达式优先级

问题描述

2 个解决方案

解决方案1
17 2013-11-11 23:42:29

解决方案2
2 2013-11-12 00:38:21

JavaScript：避免使用String.split的空字符串和正则表达式优先级

问题描述

2 个解决方案

解决方案1 17 2013-11-11 23:42:29

解决方案2 2 2013-11-12 00:38:21

解决方案1
17 2013-11-11 23:42:29

解决方案2
2 2013-11-12 00:38:21