优化弹性字符串文字解析

Question

I am starting writing a lexical analyzer for my programming language. 我开始为我的编程语言编写词法分析器。

String literals in this language start with a " and end when an unescaped " is encountered. 此语言的字符串文字以"开始" ，遇到未转义的"时结束。 Everything inside (including newlines) is preserved, except escape sequences (the usual \\n s, \\t s, \\" s etc plus a way of escaping a character by using its ASCII code, eg \\097 or \\97 ). 除转义序列（常用的\\n s， \\t s， \\" s等，外加使用ASCII码（例如\\097或\\97 ）转义字符的方式）外，所有内容（包括换行符）都将保留。

This is the code I have written so far: 这是我到目前为止编写的代码：

%{
#include <iostream>
#define YY_DECL extern "C" int yylex()

std::string buffstr;
%}
%x SSTATE
%%

\"                   {
                         buffstr.clear();
                         BEGIN(SSTATE);
                     }
<SSTATE>\\[0-9]{1,3} {
                         unsigned code = atoi(yytext + 1);
                         if (code > 255) {
                             std::cerr << "SyntaxError: decimal escape sequence larger than 255 (" << code << ')' << std::endl;
                             exit(1);
                         }
                         buffstr += code;
                     }

<SSTATE>\\a          buffstr += '\a';
<SSTATE>\\b          buffstr += '\b';
<SSTATE>\\f          buffstr += '\f';
<SSTATE>\n           buffstr += '\n';
<SSTATE>\r           buffstr += '\r';
<SSTATE>\t           buffstr += '\t';
<SSTATE>\v           buffstr += '\v';
<SSTATE>\\\\         buffstr += '\\';
<SSTATE>\\\"         buffstr += '\"';
<SSTATE>\\.          {
                         std::cerr << "SyntaxError: invalid escape sequence (" << yytext << ')' << std::endl;
                         exit(1);
                     }
<SSTATE>\"           {
                         std::cout << "Found a string: " << buffstr << std::endl;
                         BEGIN(INITIAL);
                     }
<SSTATE>.            buffstr += yytext[0];

.                    ;

%%

int main(int argc, char** argv) {
    yylex();
}

It works perfectly, but as you can see it's not particularly optimized. 它运行完美，但是如您所见，它并没有特别优化。

It's appending a character to a std::string once for each character in the string literal being parsed, which is not ideal. 对于要解析的字符串文字中的每个字符，都将一个字符附加到std :: string一次，这是不理想的。

I wonder if there's a bettere way of doing it, for an example storing a pointer and increasing a lenght and then building the string with std::string(const char* ptr, size_t lenght) . 我想知道是否有更好的方法，例如存储指针并增加长度，然后使用std::string(const char* ptr, size_t lenght)构建字符串的示例。

Is there one? 有一个吗？ What would be it? 那会是什么

Answer 1

It's probably the case that the code provided is sufficiently fast for all practical purposes, and that you should not worry about optimizing it until you actually observe it being a bottleneck. 可能是这样的情况，所提供的代码对于所有实际目的来说都是足够快的，并且您不必担心对其进行优化，直到您真正看到它成为瓶颈为止。 Lexical scans, even inefficient ones, are rarely an important contribution to compile times. 词法扫描，即使效率低下，也很少对编译时间有重要贡献。

However, some optimizations are straight-forward. 但是，有些优化是直接的。

The easiest one is to observe that most strings do not contain escape sequences. 最简单的方法是观察大多数字符串不包含转义序列。 So applying the usual optimization technique of going for the low-lying fruit, we start by handling strings without escape sequences in one single pattern, without even passing through the separate lexical state. 因此，应用通常的优化技术来寻找低洼的果实，我们从一个字符串中处理没有转义序列的字符串开始，甚至没有经过单独的词法状态。 [Note 1] [注1]

\"[^"\\]*\"   { yylval.str = new std::string(yytext + 1, yyleng - 2); 
                return T_STRING;
              }

(F)lex provides yyleng which is the length of the token it found, so there is never really any reason to recompute the length with strlen . （F）lex提供yyleng ，它是找到的令牌的长度，因此，从没有真正的理由使用strlen重新计算长度。 In this case, we don't want the surrounding double quotes in the string, so we select yyleng - 2 characters starting at the second character. 在这种情况下，我们不需要字符串中的双引号，因此我们选择yyleng - 2从第二个字符开始的2个字符。

Of course, we need to handle the escape codes; 当然，我们需要处理转义码； we can use a start condition similar to yours to do so. 我们可以使用类似于您的开始条件。 We only enter this start condition when we find an escape character inside the string literal. 只有在字符串文字内找到转义字符时，才输入此开始条件。 [Note 2] To catch this case, we rely on the maximal munch rule implemented by (f)lex, which is that the pattern with the longest match beats out any other patterns which happen to match at the same input point. [注2]为了解决这种情况，我们依赖于（f）lex实现的最大munch规则，即匹配时间最长的模式击败了其他所有在相同输入点匹配的模式。 [Note 3] Since we've already matched any token which starts with a " and does not include a backslash before the closing " , we can add a very similar pattern without the closing quote which will only match in case the first rule doesn't, because the match with the closing quote is one character longer. [注3]由于我们已经匹配了所有以“开头并且在结束”之前不包含反斜杠的标记，因此我们可以添加一个非常相似的模式而没有结束引号，只有在第一个规则不匹配的情况下才会匹配t，因为与右引号的匹配长了一个字符。

\"[^"\\]*     { yylval.str = new std::string(yytext + 1, yyleng - 1);
                BEGIN(S_STRING);
                /* No return, so the scanner will continue in the new state */
              }

In the S_STRING state, we can still match sequences (not just single characters) which don't contain a backslash, thereby reducing significantly the number of action executions and string appends: 在S_STRING状态下，我们仍然可以匹配不包含反斜杠的序列（不仅仅是单个字符），从而显着减少了动作执行和字符串追加的数量：

(Braced pattern lists in a start condition are a flex extension .) （开始条件下的带括号的模式列表是flex扩展名。）

<S_STRING>{
  [^"\\]+       { yylval.str->append(yytext, yyleng); }
  \\n           { (*yylval.str) += '\n'; }
   /* Etc. Handle other escape sequences similarly */
  \\.           { (*yylval.str) += yytext[1]; }
  \\\n          { /* A backslash at the end of the line. Do nothing */ }
  \"            { BEGIN(INITIAL); return T_STRING; }
     /* See below */
}

When we eventually find an unescaped double-quote, which will match the last pattern, we first reset the lexical state, and then return the string which has been completely constructed. 当我们最终找到与最后一个模式匹配的未转义的双引号时，我们首先重置词法状态，然后返回已完全构造的字符串。

The pattern \\\\\\n actually matches a backslash at the very end of the line. 模式\\\\\\n实际上与该行末尾的反斜杠匹配。 It's common to completely ignore this backslash and the newline, in order to allow long strings to be continued over several source lines. 通常，完全忽略此反斜杠和换行符，以便允许长字符串在多个源代码行上继续。 If you don't want to provide this feature, just change the \\. 如果您不想提供此功能，只需更改\\. pattern to \\(.|\\n) . 模式为\\(.|\\n) 。

And what if we don't find an unescaped double-quote? 如果我们找不到未转义的双引号怎么办？ That is, what if the closing double quote was accidentally omitted? 也就是说，如果意外省略了双引号怎么办？ We will end up in the S_STRING start condition in this case, since the string was not terminated by a quote, and so the fallback pattern will match. 在这种情况下，我们将以S_STRING开始条件结束，因为字符串没有以引号终止，因此后备模式将匹配。 In the S_STRING patterns, we need to add two more possibilities: 在S_STRING模式中，我们需要添加另外两种可能性：

<S_STRING>{
    // ... As above
  <<EOF>>      |
  \\           { /* Signal a lexical error */ }
}

The first of these rules catches the simple unterminated string error. 这些规则中的第一个捕获了简单的未终止的字符串错误。 The second one catches the case in which a backslash was not followed by a legitimate character, which given the other rules can only happen if a backslash is the very last character in a program with an unterminated string. 第二种情况是在反斜杠后没有合法字符的情况下发生的，给定其他规则，只有在反斜杠是程序中最后一个具有未终止字符串的字符时，该规则才会发生。 Unlikely though that is, it can happen so we should catch it. 尽管这不太可能，但是它有可能发生，所以我们应该抓住它。

One further optimization is relatively simple, although I wouldn't recommend it because it mostly just complicates the code, and the benefit is infinitesimal. 进一步的优化相对简单，尽管我不推荐这样做，因为它主要只是使代码变得复杂，并且好处是无穷的。 (For this very reason, I haven't included any sample code.) （由于这个原因，我没有包含任何示例代码。）

In the start condition, a backslash (almost) always results in appending a single character to the string we're accumulating, which means that we might resize the string in order to do this append, even though we just resized it to append the non-escaped characters. 在开始条件下，反斜杠（几乎）总是导致将单个字符追加到我们要累积的字符串中，这意味着我们可以调整字符串的大小以进行此追加，即使我们只是调整大小以追加非字符也是如此。 -转义字符。 Instead, we could add one additional character to the string in the action which matches the non-escape characters. 相反，我们可以在操作中向字符串添加一个与非转义字符匹配的附加字符。 (Because (f)lex modifies the input buffer to NUL-terminate the token, the character following the token will always be a NUL, so increasing the length of the append by one will insert this NUL and not the backslash into the string. But that's not important.) （由于（f）lex将输入缓冲区修改为以NUL终止的令牌，因此令牌后的字符将始终为NUL，因此将追加长度增加1将在字符串中插入此NUL而不是反斜杠。但是这并不重要。）

Then the code which handles the escape character needs to replace the last character in the string rather than appending a single character to the string, thereby avoiding one append call. 然后，用于处理转义字符的代码需要替换字符串中的最后一个字符，而不是将单个字符附加到字符串中，从而避免一个附加调用。 Of course, in the cases where we don't want to insert anything, we'll need to reduce the size of the string by one character, and if there is an escape sequence (such as unicode escapes) which add more than one byte to the string, we'll need to do some other acrobatics. 当然，在不希望插入任何内容的情况下，我们需要将字符串的大小减少一个字符，并且如果存在一个转义序列（例如unicode转义）会增加一个以上的字节对字符串，我们需要做一些其他的杂技。

In short, I'd qualify this as a hack more than an optimization. 简而言之，我认为这不仅仅是优化，还算是骇客。 But for what it's worth, I have done things like this in the past, so I have to plead guilty to the charge of premature optimization, too. 但是就其价值而言，我过去曾做过这样的事情，因此我也必须对过早的优化负责。

Notes 笔记

Your code only prints out the token, which makes it hard to know what your design is for passing the string to the parser. 您的代码仅打印出令牌，这使得很难知道将字符串传递给解析器的设计。 I'm assuming here one more or less standard strategy in which the semantic value yylval is a union one of whose members is a std::string* ( not a std::string ). 我在这里假设一种或多或少的标准策略，其中语义值yylval是一个并集，其成员是std::string* （不是 std::string ）。 I don't address the resulting memory management issues, but a %destruct declaration will help a lot. 我没有解决由此产生的内存管理问题，但是%destruct声明会有所帮助。
In the original version of this answer, I suggested catching this case by using a pattern which matches a backslash as trailing context: 在此答案的原始版本中，我建议通过使用与反斜杠匹配的模式作为尾随上下文来捕获这种情况：
```
 \\"[^"\\\\]*/\\\\ { yylval.str = new std::string(yytext + 1, yyleng - 1); BEGIN(S_STRING); /* No return, so the scanner will continue in the new state */ } 
```
But using the maximal munch rule is simpler and more general. 但是使用最大munch规则更简单，更通用。
If more than one pattern has the same longest match, the first one in the scanner description wins. 如果多个图案具有相同的最长匹配，则以扫描仪描述中的第一个为准。

优化弹性字符串文字解析

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-02-21 04:00:11

Notes 笔记

优化弹性字符串文字解析

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-02-21 04:00:11

Notes 笔记

解决方案1
2 已采纳 2017-02-21 04:00:11