简体   繁体   English

解析C文件而不预处理它

[英]Parsing C files without preprocessing it

I want to run simple analysis on C files (such as if you call foo macro with INT_TYPE as argument, then cast the response to int* ), I do not want to prerprocess the file, I just want to parse it (so that, for instance, I'll have correct line numbers). 我想对C文件运行简单的分析(例如,如果你用INT_TYPE作为参数调用foo宏,然后将响应转换为int* ),我不想预处理文件,我只是想解析它(这样,例如,我将有正确的行号)。

Ie, I want to get from 即,我想得到

#include <a.h>

#define FOO(f)

int f() {FOO(1);}

an list of tokens like 一个令牌列表

<include_directive value="a.h"/>
<macro name="FOO"><param name="f"/><result/></macro>
<function name="f">
    <return>int</return>
    <body>
        <macro_call name="FOO"><param>1</param></macro_call>
    </body>
</function>

with no need to set include path, etc. 无需设置包含路径等

Is there any preexisting parser that does it? 有没有预先存在的解析器呢? All parsers I know assume C is preprocessed. 我知道的所有解析器都假设C是经过预处理的。 I want to have access to the macros and actual include instructions. 我想访问宏和实际包含指令。

Our C Front End can parse code containing preprocesser elements can do this to fair extent and still build a usable AST. 我们的C前端可以解析包含preprocesser元素的代码,可以在相当程度上做到这一点,并且仍然构建一个可用的AST。 (Yes, the parse tree has precise file/line/column number information). (是的,解析树具有精确的文件/行/列号信息)。

There are a number of restrictions, which allows it to handle most code. 有许多限制,允许它处理大多数代码。 In those few cases it cannot handle, often a small, easy change to the source file giving equivalent code solves the problem. 在少数情况下,它无法处理,通常是对源文件的一个小的,简单的更改,给出等效的代码解决了问题。

Here's a rough set of rules and restrictions: 这是一套粗略的规则和限制:

  • #includes and #defines can occur wherever a declaration or statement can occur, but not in the middle of a statement. #includes和#defines可以出现在声明或语句可以发生的任何地方,但不能出现在语句的中间。 These rarely cause a problem. 这些很少引起问题。
  • macro calls can occur where function calls occur in expressions, or can appear without semicolon in place of statements. 宏调用可以发生在表达式中发生函数调用的地方,或者可以不用分号代替语句。 Macro calls that span non-well-formed chunks are not handled well (anybody surprised?). 跨越非格式良好的块的宏调用处理不好(任何人都感到惊讶?)。 The latter occur occasionally but not rarely and need manual revision. 后者偶尔发生但很少发生,需要手动修改。 OP's example of "j(v,oid)*" is problematic, but this is really rare in code. OP的“j(v,oid)*”的例子是有问题的,但这在代码中很少见。
  • #if ... #endif must be wrapped around major language concepts (nonterminals) (constant, expression, statement, declaration, function) or sequences of such entities, or around certain non-well-formed but commonly occurring idioms, such as if (exp) { . #if ... #endif必须包含主要语言概念(非终结符) (常量,表达式,语句,声明,函数)或此类实体的序列,或围绕某些非格式良好但常见的惯用语,例如if (exp){ Each arm of the conditional must contain the same kind of syntactic construct as the other arms. 条件的每个臂必须包含与其他臂相同类型的句法结构。 #if wrapped around random text used as bad kind of comment is problematic, but easily fixed in the source by making a real comment. #if缠绕随机文本作为坏评论是有问题的,但通过做出真正的评论很容易在源中修复。 Where these conditions are not met, you need to modify the original source code, often by moving the #if #elsif #else #end a few tokens. 如果不满足这些条件,则需要修改原始源代码,通常是移动#if #elsif #else #end几个标记。

In our experience, one can revise a code base of 50,000 lines in a few hours to get around these issues. 根据我们的经验,人们可以在几个小时内修改50,000行的代码库来解决这些问题。 While that seems annoying (and it is), the alternative is to not be able to parse the source code at all, which is far worse than annoying. 虽然这看起来很烦人(而且确实如此),但替代方案是根本无法解析源代码,这比烦人的要糟糕得多。

You also want more than just a parser. 您还需要的不仅仅是解析器。 See Life After Parsing , to know what happens after you succeed in getting a parse tree. 请参阅解析后的生活 ,了解成功获取解析树后会发生什么。 We've done some additional work in building symbol tables in which the declarations are recorded with the preprocessor context in which they are embedded, enabling type checking to include the preprocessor conditions. 我们在构建符号表方面做了一些额外的工作,其中声明是使用嵌入它们的预处理器上下文记录的,从而使类型检查能够包含预处理器条件。

You can have a look at this ANTLR grammar . 你可以看看这个ANTLR语法 You will have to add rules for preprocessor tokens, though. 但是,您必须为预处理程序令牌添加规则。

Your specific example can be handled by writing your own parsing and ignore macro expansion. 您可以通过编写自己的解析并忽略宏扩展来处理您的具体示例。

Because FOO(1) itself can be interpreted as a function call. 因为FOO(1)本身可以解释为函数调用。

When more cases are considered however, the parser is much more difficult. 但是,当考虑更多情况时,解析器要困难得多。 You can refer PDF Link to find more information. 您可以参考PDF链接以查找更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM