简体   繁体   English

用于多行注释的 Unix Flex 正则表达式

[英]Unix Flex Regex for Multi-Line Comments

I am making a Lexical Analyzer using Flex on Unix.我正在 Unix 上使用 Flex 制作词法分析器。 If you've ever used it before you know that you mainly just define the regex for the tokens of whatever language you are writing the Lexical Analyzer for.如果您在使用过它之前就知道您主要只是为您正在编写词法分析器的任何语言的标记定义正则表达式。 I am stuck on the final part.我被困在最后一部分。 I need the correct Regex for multi-line comments that allows something like我需要正确的正则表达式来进行多行注释,它允许类似

/* This is a comment \*/

but also allows但也允许

/* This **** //// is another type of comment */

Can anyone help with this?有人能帮忙吗?

You don't match C style comments with a simple regular expression in Flex;你不能用 Flex 中的简单正则表达式来匹配 C 风格的注释; they require a more complex matching method based on start states.它们需要基于起始状态的更复杂的匹配方法。 The Flex FAQ says how (well, they do for the /*...*/ form; handling the other form in just the <INITIAL> state should be simple). Flex FAQ说明了如何(好吧,它们是为/*...*/表单做的;在<INITIAL>状态下处理另一个表单应该很简单)。

If you're required to make do with just regex, however, there is indeed a not-too-complex solution:但是,如果您只需要使用正则表达式,那么确实有一个不太复杂的解决方案:


"/*"( [^*] | (\\*+[^*/]) )*\\*+\\/ "/*"( [^*] | (\\*+[^*/]) )*\\*+\\/
The full explanation and derivation of that regex is excellently elaborated upon here . 该正则表达式的完整解释和推导在这里得到了很好的阐述。
In short: 简而言之:
  • "/*" marks the start of the comment “/*”标记注释的开始
  • ( [^*] | (\\*+[^*/]) )* says accept all characters that are not * (the [^*] ) or accept a sequence of one or more * as long as the sequence does not have a '*' or a /' following it (the (*+[^*/])). ( [^*] | (\\*+[^*/]) )* 表示接受所有不是 * 的字符([^*] )或接受一个或多个 * 的序列,只要该序列没有'*' 或 /' 跟在它后面((*+[^*/]))。 This means that all ******... sequences will be accepted except for *****/ since you can't find a sequence of * there that isn't followed by a * or a /.这意味着除了 *****/ 之外的所有 ******... 序列都将被接受,因为您无法在那里找到后面没有 * 或 / 的 * 序列。
  • The *******/ case is then handled by the last bit of the RegEx which matches any number of * followed by a / to mark the end of the comment ie \\*+\\/然后 *******/ 大小写由 RegEx 的最后一位处理,它匹配任意数量的 * 后跟一个 / 来标记注释的结尾,即 \\*+\\/

  • http://www.lysator.liu.se/c/ANSI-C-grammar-l.html does: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html确实:

    "/*"            { comment(); }
    
    comment() {
        char c, c1;
    
    loop:
        while ((c = input()) != '*' && c != 0)
            putchar(c);
    
        if ((c1 = input()) != '/' && c != 0) {
            unput(c1);
            goto loop;
        }
    
        if (c != 0)
            putchar(c1);
    }
    

    A question which would also solve this is How do I write a non-greedy match in LEX / FLEX?一个也可以解决这个问题的问题是如何在 LEX / FLEX 中编写非贪婪匹配?

    i don't know flex but i do know regexs.我不知道 flex 但我知道正则表达式。 /\\/\\*.*?\\*\\//s should match both types (in PCRE), but if you need to differentiate them in your analyser, you may want to then iterate the list of matches to see if they're the second type with /\\*\\*\\s+\\/{4}/ /\\/\\*.*?\\*\\//s应该匹配这两种类型(在 PCRE 中),但是如果您需要在分析器中区分它们,您可能需要迭代匹配列表以查看它们是否是第二种类型为/\\*\\*\\s+\\/{4}/

    声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM