简体繁体 English

如何处理c和c++源代码来计算静态代码分析的指标？

[英]How to process c and c++ source code to calculate metrics for static code analysis?

原文 2019-03-21 13:01:54 5 1 c++/ parsing/ antlr4/ metrics/ llvm-clang

Iam extending a software tool to calculate metrics for software projects.我正在扩展一个软件工具来计算软件项目的指标。 The metrics are then used to do a static code analysis.然后使用这些指标进行静态代码分析。 My task is to implement the calculation of metrics for c and c++ projects.我的任务是为 c 和 c++ 项目实现指标的计算。

In the developing process i encountered problems which led to reset and starting over again with a different tool or programming language.在开发过程中，我遇到了一些问题，导致重置并使用不同的工具或编程语言重新开始。 I will state the process, problems and things i tried to solve them in chronological order and as good as possible.我将按时间顺序尽可能好地说明我试图解决的过程、问题和事情。

Some metrics:一些指标：

Lines of Code for Classes, Structs, Unions, Functions/Methods and Sourcefiles类、结构、联合、函数/方法和源文件的代码行
Method Count for Classes and Structs类和结构的方法计数
Complexity for Classes, Structs and Functions/Methods类、结构和函数/方法的复杂性
Dependencies for/between Classes and Structs类和结构之间的依赖关系

Since c++ is a difficult language to parse and writing a c++ parser on my own is out of scale i tend to use an existing c++ parser.由于 c++ 是一种难以解析的语言，我自己编写 c++ 解析器超出了规模，我倾向于使用现有的 c++ 解析器。 Therefore i began using libraries from the LLVM Project to gather syntactic and semantic information about a source file.因此，我开始使用LLVM 项目中的库来收集有关源文件的句法和语义信息。

LLVM Tooling link: https://clang.llvm.org/docs/Tooling.html LLVM 工具链接： https : //clang.llvm.org/docs/Tooling.html

First i started with LibTooling written in c++ since it promised me "full controll" over the Abstract Syntax Tree (AST).首先，我从用 C++ 编写的 LibTooling 开始，因为它答应我“完全控制”抽象语法树 (AST)。 I tried the RecursiveASTVistor and the Matchfinder approaches without success.我尝试了RecursiveASTVistor和Matchfinder方法，但没有成功。

So LibTooling was dismissed because i couldnt retrieve context information about the surrounding of a node in the AST.因此 LibTooling 被驳回，因为我无法在 AST 中检索有关节点周围的上下文信息。 I was only able to react on a callback when a specific node in the AST was visited.当访问 AST 中的特定节点时，我只能对回调做出反应。 But i didnt know in what context i currently was.但我不知道我目前处于什么环境中。 Eg.例如。 When I visit a C++RecordDeclaration (class, struct, union) i did not know if it is a nested record or not.当我访问 C++RecordDeclaration（类、结构、联合）时，我不知道它是否是嵌套记录。 But that information is needed to calculate the lines of code for a single class.但是需要这些信息来计算单个类的代码行。

Second approach was using the LibClang interface via Python Bindings.第二种方法是通过 Python 绑定使用 LibClang 接口。 With the LibClang interface i was able to traverse the AST node by node recursively and store needed context information on a stack.使用 LibClang 接口，我能够以递归方式逐个节点遍历 AST 节点，并将所需的上下文信息存储在堆栈上。 Here i encountered a general problem with LibClang:在这里，我遇到了 LibClang 的一个普遍问题：

Before creating the AST for a file the preprocessor is started and resolves all preprocessor directives.在为文件创建 AST 之前，预处理器已启动并解析所有预处理器指令。 Just as he is supposed to do.正如他应该做的那样。

This is good because if the preprocessor cant resolve all the include directives the output AST will be incomplete.这很好，因为如果预处理器无法解析所有包含指令，则输出 AST 将不完整。
This is very bad because i wont be able to provide all the include files or directories for any c++ project.这非常糟糕，因为我无法为任何 C++ 项目提供所有包含文件或目录。
This is bad because code which is surrounded by conditional preprocessor directives is not part of the AST if a preprocessor variable is defined or not.这很糟糕，因为无论是否定义了预处理器变量，被条件预处理器指令包围的代码都不是 AST 的一部分。 Parsing the same file multiple times with different setups of defined or undefined preprocessor variable is out of scope.使用已定义或未定义的预处理器变量的不同设置多次解析同一文件超出了范围。

This lead to the third and current attempt with using a c++ parser generated by Antlr provided a c++14 grammar .这导致了第三次也是当前尝试使用由Antlr生成的 c++ 解析器提供的c++14 语法。

No preprocessor is executed before the parser.在解析器之前不执行任何预处理器。 This is good because the full source code is parsed and preprocessor directives are being ignored.这很好，因为解析了完整的源代码并且忽略了预处理器指令。 Bad thing is that the parser does not seem to be that tough.不好的是解析器似乎没有那么难。 It fails on code which can be compiled leading to a broken AST.它在可以编译的代码上失败，导致 AST 损坏。 So this solution is not sufficient aswell.所以这个解决方案也是不够的。

My questions are:我的问题是：

Is there an option to deactivate the preprocessor before parsing ac/c++ source or header file with libClang?在使用 libClang 解析 ac/c++ 源文件或头文件之前，是否可以选择停用预处理器？ So the source code is untouched and the AST is complete and detailed.所以源代码是原封不动的，AST 是完整和详细的。
Is there a way to parse ac/c++ source code file without providing all the necessary include directories but still resulting in a detailed AST?有没有办法在不提供所有必要的包含目录的情况下解析 ac/c++ 源代码文件，但仍然产生详细的 AST？
Since iam running out of options.由于我没有选择。 What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?在分析/解析 c/c++ 源代码时，还有哪些其他方法值得考虑？

If you think this is not the right place to ask such questions feel free to redirect me to another place.如果您认为这不是提出此类问题的正确地方，请随时将我重定向到另一个地方。

1 个解决方案

To answer your last question,回答你最后一个问题，

Since iam running out of options.由于我没有选择。 What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?在分析/解析 c/c++ 源代码时，还有哪些其他方法值得考虑？

Another approach is to parse the source code as if it were merely text.另一种方法是解析源代码，就好像它只是文本一样。 This avoids the need to preprocess the source, and to bring in a complex parser.这避免了预处理源的需要，并避免引入复杂的解析器。 See this paper for an example/introduction: "The Conceptual Cohesion of Classes" by Andrian Marcus, Denys Poshyvanyk.有关示例/介绍，请参阅本文：Andrian Marcus、Denys Poshyvanyk 的“类的概念内聚”。 You can still collect such information as LOC and number of methods from this approach, without needing a full parser.您仍然可以通过这种方法收集诸如 LOC 和方法数量之类的信息，而无需完整的解析器。

This approach has drawbacks (as does any approach):这种方法有缺点（就像任何方法一样）：

It either 1) parses comments along with the source code, or 2) requires that you remove comments from the source.它要么 1) 与源代码一起解析注释，要么 2) 要求您从源代码中删除注释。 But the latter is an easy step.但后者是一个简单的步骤。 The reason that might be OK is that even the comments contain information regarding the code, which may help determine which modules are more closely coupled, etc.可能没问题的原因是，即使注释也包含有关代码的信息，这可能有助于确定哪些模块耦合更紧密，等等。
It will lump local variables, method names, parameter names, etc. all into the "bag of words" that you are working with.它会将局部变量、方法名称、参数名称等全部放入您正在使用的“词袋”中。