简体   繁体   English

如何用AST树或其他工具进行静态代码逻辑分析?

[英]How to do static code logical analysis with AST tree or other tool?

void f1(char *s)
{
 s[20] = 0;
}
void f2()
{
 char a[10];
 if (x + y == 2) {
 f1(a);
 }
}

Cppcheck will report this message: Array 'a[10]' index 20 out of bounds Cppcheck将报告此消息:数组'a [10]'索引20超出范围

How could Cppcheck get the connection between 'a' in f2 and 's' in f1? Cppcheck怎么能得到f2中'a'和f1中's'之间的联系?

I have built AST tree, But It only supplies information of each symbol, and give little information to me on the logical relationship of symbols. 我已经构建了AST树,但它只提供每个符号的信息,并且在符号的逻辑关系上给我很少的信息。 How could computer know 'a' in f2 and 's' in f1 are the same thing? 计算机如何知道f1中的'a'和f1中的's'是一回事? As I know, we have to take so many situations into consideration, such as: 据我所知,我们必须考虑很多情况,例如:

void f1(char *s)
{
 char str_arry[30];
 s= str_arry;
 s[20] = 0;
}

In this case 's' and 'a' are not the same things. 在这种情况下,'s'和'a'不是一回事。

I don't know how exactly Cppcheck works but I'll tell you how to solve this problem in general. 我不知道Cppcheck究竟是如何工作的,但我会告诉你如何解决这个问题。 There are two main approaches to the analysis of interrelated functions. 分析相互关联的函数有两种主要方法。

In the first case, when an analyzer meets function call it starts analyzing its body considering value of factual arguments transmitted through the function. 在第一种情况下,当分析器遇到函数调用时,它会考虑通过函数传输的事实参数的值来开始分析其身体。 This happens naturally only if it is known which values are transmitted to the function. 只有知道哪些值传输到函数时,才会自然发生这种情况。 This refers to: an exact value, a range, a set of values, null/non-null pointer, etc. The complexity of the transmitted information depends on the analyzer sophistication. 这指的是:精确值,范围,一组值,空/非空指针等。传输信息的复杂性取决于分析器的复杂程度。 For example, it can start analyzing the function body knowing that two of the transmitted pointers refer to the same array. 例如,它可以开始分析函数体,知道两个传输的指针指向同一个数组。

It's an excellent accurate approach. 这是一种非常精确的方法。 But there's a serious problem. 但是这是一个严重的问题。 The analyzers based on this concept are very slow. 基于这个概念的分析仪非常慢。 They have to analyze functions bodies with different input data sets over and over again. 他们必须一遍又一遍地分析具有不同输入数据集的函数体。 The functions in turn call other ones and so on. 这些函数依次调用其他函数,依此类推。 And at some point the "inside" analysis has to be stopped which, in practice, makes this approach not that accurate and excellent as it might seem in theory. 并且在某些时候必须停止“内部”分析,这在实践中使得这种方法不像理论上那样准确和优秀。

There's a second approach. 还有第二种方法。 It's based on automatic function annotations. 它基于自动功能注释。 The thing is, when analyzing functions the information on how its arguments are used and which values they can't take is being gazed. 问题是,在分析函数时,关于如何使用其参数以及它们不能采用哪些值的信息正在被注视。 Let's consider the simple example that I gave in the article called ' Technologies used in the PVS-Studio code analyzer for finding bugs and potential vulnerabilities '. 让我们考虑一下我在文章“ PVS-Studio代码分析器中使用技术来发现漏洞和潜在漏洞 ”中提供的简单示例。

int Div(int X)
{
  return 10 / X;
}
void Foo()
{
  for (int i = 0; i < 5; ++i)
    Div(i);
}

An analyzer recognizes that X variable is used in Div function as a divider. 分析器识别出X变量在Div函数中用作分频器。 Based on it, a special Div function annotation is created automatically. 在此基础上,自动创建一个特殊的Div函数注释。 Then it takes into account the fact that a range of [0..4] values is transmitted to the function as the X argument. 然后它考虑了一系列[0..4]值作为X参数传递给函数的事实。 The analyzer concludes that the division by zero should appear. 分析仪得出结论,应该出现除零。

This approach is more crude and not that accurate as the first one. 这种方法比较粗略,并不像第一种那样准确。 But it is very fast and allows to create strong correlations between big amount of functions with no loss of productivity. 但它非常快,并且允许在大量功能之间建立强大的相关性而不会降低生产率。

It can be much more complicated in practice. 它在实践中可能要复杂得多。 For example, the PVS-Studio analyzer uses the second approach as the main one but not always. 例如,PVS-Studio分析仪使用第二种方法作为主要方法,但并非总是如此。 Sometimes when dealing with template functions we analyze them once more (the first approach). 有时在处理模板函数时,我们再次分析它们(第一种方法)。 In other words, we use a combined approach to maintain the balance between the depth and speed of analysis. 换句话说,我们使用组合方法来保持分析的深度和速度之间的平衡。

How could Cppcheck get the connection between 'a' in f2 and 's' in f1? Cppcheck怎么能得到f2中'a'和f1中's'之间的联系?

They are definitely not the same. 他们绝对不一样。 One of the following can happen: 可能发生以下情况之一:


You pass a to the function, and CPPcheck continues to remember the size of a , even though you access it with the formal parameter s . 你传递a的功能,并CPPcheck继续记住的大小a ,即使你与正式参数访问s

You have to keep in mind that static analysis tools and compilers work differently, with different purposes in mind. 您必须记住,静态分析工具和编译器的工作方式不同,考虑到不同的目的。 Static analysis tools were crated EXACTLY for the purpose of catching things like you presented in your question. 静态分析工具完全是为了捕捉您在问题中呈现的内容。


In your second example you have: 在你的第二个例子中,你有:

s= str_arry;

which removes the connection between s and a . 它删除了sa之间s连接。

In order to analyze the possible sources of some value, it's a good idea to turn all variables into immutables by introducing a new symbol whenever the original was changed and using the new symbol for all following occurences (the original symbol won't be used after the point where it was re-assigned in the original code). 为了分析某些值的可能来源,最好通过在原始更改时引入新符号并将所有后续出现时使用新符号将所有变量转换为不可变(原始符号将不会在之后使用在原始代码中重新分配的点)。

Consider the following code: 请考虑以下代码:

// control flow block 1
int i = 1;
if (some_condition()) {
    // control flow block 2
    i = 2;
}
// control flow block 3
int j = i;

With the control flow graph 使用控制流程图

[1]
 | \     <- if (some_condition())
 |  [2]
 | /     <- join of control flow after the if block ends
[3]

You could write a list of all symbols that are alive (have a value that is used anywhere later in the control flow graph) at the entry and exit point of a block in the control flow graph: 您可以在控制流图中的块的入口和出口点处写下所有有效符号的列表(具有在控制流图中稍后使用的值):

[1] entry: nothing; exit: i
[2] entry: nothing; exit: i
[3] entry: i; exit: i, j (I assume i, j are re-used after the end of this example)

Notice that [2] entry is empty, since i is never read and always written within block [2] . 请注意, [2] entry为空,因为i永远不会读取并始终在块[2]内写入。 The problem with this representation is, that i is in the exit list of all blocks but it has different possible values for each block. 这种表示的问题是, i在所有块的退出列表中,但它对每个块具有不同的可能值。

So, lets introduce the immutable symbols in pseudo-code: 所以,让我们在伪代码中引入不可变符号:

// control flow block 1
i = 1;
if (some_condition()) {
    // control flow block 2
    i_1 = 2;
}
// control flow block 3
// join-logic of predecessor [1] and [2]
i_2 = one_of(i, i_1);
j = i_2;

Now every variable is coupled exactly to its first (and only) assignment. 现在,每个变量都与其第一个(也是唯一的)赋值完全耦合。 Meaning, a dependency graph can be constructed by analyzing the symbols that are involved in an assignment 意思是,可以通过分析赋值中涉及的符号来构造依赖图

i   -> i_2
i_1 -> i_2
i_2 -> j

Now in case there is any constraint on the allowed value of j , a static checker could require that all predecessors of j (namely i_2 , in turn originating from i and i_1 ), satisfy this requirement. 现在,如果对j的允许值有任何约束 ,静态检查器可能要求j 所有前导 (即i_2 ,又来自ii_1 )满足此要求。

In case of function calls, the dependency graph would contain an edge from every calling argument to the corresponding parameter in the function definition. 在函数调用的情况下,依赖图将包含从每个调用参数到函数定义中相应参数的边。

Applying this to your example is straight forward if we only focus on the array variable and ignore changes to the array content (I'm not quite sure to what extent a static checker would track the content of individual array items in order to find danger down the road): 如果我们只关注数组变量并忽略对数组内容的更改,那么将它应用于您的示例是直截了当的(我不太确定静态检查器会在多大程度上跟踪单个数组项的内容以便找到危险马路):

Example 1: 例1:

void f1(char *s)
{
    s[20] = 0;
}

void f2()
{
    char a[10];
    if (x + y == 2) {
        f1(a);
    }
}

Transforms to 转变为

f1(s)
{
    s[20] = 0;
}

f2()
{
    a = char[10];
    if (x + y == 2) {
        call f1(a);
    }
}

With dependency graph including the passed arguments via function call 依赖图包含通过函数调用传递的参数

a -> s

So it's immediately clear that a has to be considered for the static analysis of the safety of s[20] . 因此,它是立即明显, a必须要考虑的安全性的静态分析s[20]

Example 2: 例2:

void f1(char *s)
{
    char str_arry[30];
    s= str_arry;
    s[20] = 0;
}

Transforms to 转变为

f1(s)
{
    // control flow block 1
    str_arry = char[30];
    s_1 = str_arry;
    s_1[20] = 0;
}

With dependency graph 使用依赖图

str_arry -> s_1

So it's immediately clear that the only value to be considered for the static analysis of the safety of s_1[20] is str_arry . 因此,很明显,对于s_1[20]的安全性的静态分析,唯一要考虑的值是str_arry

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM