简体   繁体   中英

Searching for partial substring within string in C#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code

For example

        {
            System.IO.File.Delete(@"C:\Users\Public\DeleteTest\test.txt");
        }

        //Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b

Gets Changed to -

        {
            System.IO.File.Delete(@"C:\Users\Public\DeleteTest\notatest.txt");
        }
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b

Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.

I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.

Thanks in advance for any input

Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.

For example, you may statically set that if the distance is less than five edits, the change is relevant.

You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.

First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).

Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.

In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?

There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.

Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).

I suggest you look into Levenshtein distance and Damerau-Levenshtein distance .

The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.

I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.

There are code examples on both articles.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM