在C＃中搜索字符串中的部分子字符串

Question

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code 好的，所以我试图在C＃中制作一个基本的恶意软件扫描程序，我的问题就是说我有一个特殊位代码的Hex签名

For example 例如

        {
            System.IO.File.Delete(@"C:\Users\Public\DeleteTest\test.txt");
        }

        //Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b

Gets Changed to - 变为 -

        {
            System.IO.File.Delete(@"C:\Users\Public\DeleteTest\notatest.txt");
        }
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b

Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged. 请记住，这些位将在程序的整个十六进制内 - 我怎样才能获取我的基本签名并寻找具有90％匹配的部分匹配因此被标记。

I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. 我会做一个通配符，但这对于稍微复杂的东西不起作用，它可能编码略有不同，但大部分都是相同的。 So is there a way I can do a percent match for a substring? 那么有没有办法可以为子字符串进行百分比匹配？ I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario. 我正在研究Levenshtein距离，但我不知道如何将它应用到这个给定的场景中。

Thanks in advance for any input 提前感谢任何输入

Answer 1

Using an edit distance would be fine. 使用编辑距离就可以了。 You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. 您可以使用两个字符串并计算编辑距离，该距离将是一个整数值，表示将一个字符串带到另一个字符串所需的操作数。 You set your own threshold based off that number. 您可以根据该数字设置自己的阈值。

For example, you may statically set that if the distance is less than five edits, the change is relevant. 例如，您可以静态设置如果距离小于五次编辑，则更改是相关的。

You could also take the length of string you are comparing and take a percentage of that. 您还可以使用您正在比较的字符串的长度并取一定百分比。 Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold. 你的例子是36个字符长，所以(int)(input.Length * 0.88m)将是一个有效的threashold。

Answer 2

First, your program bits should match EXACTLY or else it has been modified or is corrupt. 首先，您的程序位应完全匹配，否则它已被修改或已损坏。 Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match). 通常，您将在原始二进制文件上存储MD5哈希，并根据新版本检查MD5以查看它们是否“足够相同”（MD5不能保证100％匹配）。

Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. 除此之外，为了检测随机二进制文件中的恶意软件，您必须知道要查找的模式类型。 For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. 例如，如果我知道一件恶意软件使用一些二进制XYZ注入代码，我将在可执行文件的位中查找XYZ。 Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. 当然，模式比这更复杂，因为恶意软件位可以在chuncks中展开。 What is more interesting is that some viruses are self-morphing. 更有趣的是，一些病毒是自我变形的。 This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. 这意味着每次运行时，它都会自行修改，这意味着扫描仪不知道要找到的确切模式。 In these cases, the scanner must know the types of derivatives can be produced and look for all of them. 在这些情况下，扫描仪必须知道可以生成衍生物的类型并查找所有衍生物。

In terms of finding a % match, this operation is very time consuming unless you have constraints. 在查找％匹配方面，除非您有约束，否则此操作非常耗时。 By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. 通过比较2个字符串，您无法分辨哪些部分被删除，添加或替换。 For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? 例如，如果我有一个起始字符串'ABCD'，那么'AABCDD'是100％匹配还是更少，因为添加了内容？ What about 'ABCDABCD'; 怎么样'ABCDABCD'; here it matches twice. 这里匹配两次。 How about 'AXBXCXD'? 'AXBXCXD'怎么样？ What about 'CDAB'? 那么'CDAB'呢？

There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). 现有许多DIFF工具可以告诉您文件的哪些部分已被更改（可能导致％）。 Unfortunately, none of them are perfect because of the issues that I described above. 不幸的是，由于我上面描述的问题，它们都不是完美的。 You will find that you have false negatives, false positives, etc. This may be 'good enough' for you. 你会发现你有假阴性，误报等等。这对你来说可能“足够好”。

Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. 在您确定适合您的特定算法之前，您必须确定搜索的限制。 Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file). 否则，您的扫描将是NP难的，这会导致不合理的运行时间（您的扫描仪可能会整天运行以检查一个文件）。

Answer 3

I suggest you look into Levenshtein distance and Damerau-Levenshtein distance . 我建议你看看Levenshtein距离和Damerau-Levenshtein距离。

The former tells you how many add/delete operations are needed to turn one string into another; 前者告诉您将一个字符串转换为另一个字符串需要多少个添加/删除操作; and the latter tells you how many add/delete/replace operations are needed to turn one string into another. 后者告诉你需要多少次添加/删除/替换操作才能将一个字符串转换成另一个字符串。

I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling. 在编写用户可以搜索内容的程序时，我会使用这些，但他们可能不知道确切的拼写。

There are code examples on both articles. 两篇文章都有代码示例。

在C＃中搜索字符串中的部分子字符串

问题描述

3 个解决方案

解决方案1
1 已采纳 2012-08-20 20:57:29

解决方案2
1 2012-08-20 21:17:04

解决方案3
0 2012-08-21 05:02:51

在C＃中搜索字符串中的部分子字符串

问题描述

3 个解决方案

解决方案1 1 已采纳 2012-08-20 20:57:29

解决方案2 1 2012-08-20 21:17:04

解决方案3 0 2012-08-21 05:02:51

解决方案1
1 已采纳 2012-08-20 20:57:29

解决方案2
1 2012-08-20 21:17:04

解决方案3
0 2012-08-21 05:02:51