简体   繁体   中英

Knuth Morris Pratt search algorithm for search & replace on a stream

I want to implement a search and replace algorithm on a stream, where I would create a stream class that would take a source stream as input, and who's read method would read from the source stream as needed and perform a search/replace as the stream is being read. This would mitigate the entire set of data from being loaded into memory at once and enable search & replace on very large data sets.

To do this I thought I would start with an existing search algorithm, adapt it to a stream approach, then adapt the replace functionality.

Below is a Knuth Morris Pratt implementation I adapted from an online example. Has anybody adapted something like this to a stream approach? I would need to take into account searching across read boundaries, which I'm not sure how I would do yet.

/// <summary>
/// The KMP matching algorithm uses degenerating property (SearchPattern having same sub-SearchPatterns appearing more than once in the SearchPattern) 
/// of the SearchPattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we 
/// detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take 
/// advantage of this information to avoid matching the characters that we know will anyway match. 
/// </summary>
/// <seealso cref="https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/"/>
public class KMPSearch
{
    /// <summary>
    /// Pattern we are looking for
    /// </summary>
    readonly string sSearchPattern_m;

    /// <summary>
    /// Text we are looking in
    /// </summary>
    readonly string sData_m;

    /// <summary>
    /// Index for text
    /// </summary>
    private int iDataPosition_m;
    private int iDataLength_m;

    /// <summary>
    /// Index for search pattern
    /// </summary>
    private int iSearchPatternPosition_m;

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// </summary>
    readonly int[] lstLongestProperPrefix_m;

    public KMPSearch(string sSearchPattern, string sData)
    {
        this.sSearchPattern_m = sSearchPattern;
        this.sData_m = sData;
        this.iDataLength_m = sData.Length;

        // create lps that will hold the longest prefix suffix values for SearchPattern             
        this.lstLongestProperPrefix_m = new int[sSearchPattern.Length];

        // Pre-process the SearchPattern (calculate lps array) 
        this.ComputeLPSArray();

        this.iDataPosition_m = 0; // index for txt 
        this.iSearchPatternPosition_m = 0; // index for pat
    }

    /// <summary>
    /// Find next match
    /// </summary>
    /// <returns></returns>
    public int Next()
    {
        int iMatchIndex = -1;

        //We start comparison of pat[iSearchPatternPosition_m] with iSearchPatternPosition_m = 0 with characters of current window of text.
        //We keep matching characters txt[iDataPosition_m] and pat[iSearchPatternPosition_m] and keep incrementing iDataPosition_m and iSearchPatternPosition_m while 
        //pat[iSearchPatternPosition_m] and txt[iDataPosition_m] keep matching.
        while (iDataPosition_m < this.iDataLength_m)
        {
            if (this.sSearchPattern_m[iSearchPatternPosition_m] == this.sData_m[iDataPosition_m])
            {
                iSearchPatternPosition_m++;
                iDataPosition_m++;
            }

            if (iSearchPatternPosition_m == sSearchPattern_m.Length)
            {
                //Console.WriteLine("Found SearchPattern at index %d ", iDataPosition_m - iSearchPatternPosition_m);
                iMatchIndex = iDataPosition_m - iSearchPatternPosition_m;

                iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
            }

            // mismatch after j matches 
            else if (iDataPosition_m < this.iDataLength_m && this.sSearchPattern_m[iSearchPatternPosition_m] != this.sData_m[iDataPosition_m])
            {
                //When we see a mismatch
                //* We know that characters pat[0..iSearchPatternPosition_m - 1] match with txt[iDataPosition_m-iSearchPatternPosition_m..iDataPosition_m - 1] 
                //  (Note that iSearchPatternPosition_m starts with 0 and increment it only when there is a match).
                //* We also know (from above definition) that lps[iSearchPatternPosition_m - 1] is count of characters of pat[0…iSearchPatternPosition_m - 1] 
                //  that are both proper prefix and suffix.
                //* From above two points, we can conclude that we do not need to match these lps[iSearchPatternPosition_m - 1] characters with 
                //  txt[iDataPosition_m  -iSearchPatternPosition_m..iDataPosition_m - 1] because we know that 
                //  these characters will anyway match. Let us consider above example to understand this.

                // Do not match lps[0..lps[iSearchPatternPosition_m - 1]] characters, 
                // they will match anyway 
                if (iSearchPatternPosition_m != 0)
                {
                    iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
                }
                else
                {
                    iDataPosition_m = iDataPosition_m + 1;
                }
            }

            if (iMatchIndex > -1)
            {
                return iMatchIndex;
            }
        }
        return iMatchIndex;
    }

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// Fills lps for given pattern pat[0..M-1] 
    /// lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i]. 
    /// </summary>
    private void ComputeLPSArray()
    {
        // length of the previous longest prefix suffix 
        int len = 0;

        this.lstLongestProperPrefix_m[0] = 0; // lps[0] is always 0 

        // the loop calculates lps[i] for i = 1 to M-1 
        int i = 1;
        while (i < this.sSearchPattern_m.Length)
        {
            if (this.sSearchPattern_m[i] == this.sSearchPattern_m[len])
            {
                len++;
                this.lstLongestProperPrefix_m[i] = len;
                i++;
            }
            else // (pat[i] != pat[len]) 
            {
                // This is tricky. Consider the example. 
                // AAACAAAA and i = 7. The idea is similar 
                // to search step. 
                if (len != 0)
                {
                    len = this.lstLongestProperPrefix_m[len - 1];

                    // Also, note that we do not increment 
                    // i here 
                }
                else // if (len == 0) 
                {
                    this.lstLongestProperPrefix_m[i] = 0;
                    i++;
                }
            }
        }
    }
}

Then I took the above algorithm and retrofitted it into a Stream. As a proof of concept, the stream will raise an event during it's read method every time it finds the search pattern. It currently has a limitation of not being able to search across the read boundaries. So if 1024 bytes are read at a time, and the length of the source stream is 2048 bytes, two reads are executed to read the entire stream. The issue is that if the search pattern starts at index 1000 and is 40 bytes long, it will not be found. I think once this issue is solved, the actual replace functionality won't be that difficult. I'm looking for suggestions on how to implement the search across read boundaries. It probably involves caching part of the previous read. Is anybody aware of a streaming implementation similar to this or suggestions?

public class KMPSearchStream : Stream
{
    public override bool CanRead { get { return true; } }

    public override bool CanSeek => throw new NotImplementedException();

    public override bool CanWrite => throw new NotSupportedException();

    public override long Length => throw new NotSupportedException();

    public override long Position { get => throw new NotImplementedException(); set => throw new NotImplementedException(); }

    public class PatternFoundEventArgs
    {
        public int Index { get; internal set; }
    }

    public delegate void PatternFoundEvent(PatternFoundEventArgs e);

    private Stream strSource_m;

    /// <summary>
    /// Pattern we are looking for
    /// </summary>
    private byte[] searchPattern_m;

    /// <summary>
    /// Text we are looking in
    /// </summary>
    private byte[] data_m;

    /// <summary>
    /// Index for text
    /// </summary>
    private int iDataPosition_m;
    private int iDataLength_m;

    /// <summary>
    /// Index for search pattern
    /// </summary>
    private int iSearchPatternPosition_m;

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// </summary>
    readonly int[] lstLongestProperPrefix_m;

    public KMPSearchStream(Stream strSource, byte[] searchPattern)
    {
        if (strSource == null)
        {
            throw new ArgumentNullException(nameof(strSource), "Source stream is null.");
        }
        if (searchPattern == null || searchPattern.Length == 0)
        {
            throw new ArgumentNullException(nameof(searchPattern), "Pattern to find is null or empty.");
        }

        this.strSource_m = strSource;
        this.searchPattern_m = searchPattern;

        // create lps that will hold the longest prefix suffix values for SearchPattern             
        this.lstLongestProperPrefix_m = new int[searchPattern.Length];

        // Pre-process the SearchPattern (calculate lps array) 
        this.ComputeLPSArray();

        this.iDataPosition_m = 0; // index for txt 
        this.iSearchPatternPosition_m = 0; // index for pat
    }

    public event PatternFoundEvent OnPatternFound;

    public override void Flush()
    {
        throw new NotSupportedException();
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotSupportedException();
    }

    public override void SetLength(long value)
    {
        throw new NotSupportedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotSupportedException();
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        int iRead = this.strSource_m.Read(buffer, offset, count);

        this.iDataPosition_m = 0; // index for txt 
        this.iSearchPatternPosition_m = 0; // index for pat

        this.data_m = buffer;
        this.iDataPosition_m = offset;
        this.iDataLength_m = iRead;

        int iIndex;

        while ((iIndex = this.Next()) > -1)
        {
            this.OnPatternFound(new PatternFoundEventArgs()
            {
                Index = iIndex
            });
        }
        return iRead;
    }

    /// <summary>
    /// A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. 
    /// Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
    /// Fills lps for given pattern pat[0..M-1] 
    /// lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of pat[0..i]. 
    /// </summary>
    private void ComputeLPSArray()
    {
        // length of the previous longest prefix suffix 
        int len = 0;

        this.lstLongestProperPrefix_m[0] = 0; // lps[0] is always 0 

        // the loop calculates lps[i] for i = 1 to M-1 
        int i = 1;
        while (i < this.searchPattern_m.Length)
        {
            if (this.searchPattern_m[i] == this.searchPattern_m[len])
            {
                len++;
                this.lstLongestProperPrefix_m[i] = len;
                i++;
            }
            else // (pat[i] != pat[len]) 
            {
                // This is tricky. Consider the example. 
                // AAACAAAA and i = 7. The idea is similar 
                // to search step. 
                if (len != 0)
                {
                    len = this.lstLongestProperPrefix_m[len - 1];

                    // Also, note that we do not increment 
                    // i here 
                }
                else // if (len == 0) 
                {
                    this.lstLongestProperPrefix_m[i] = 0;
                    i++;
                }
            }
        }
    }

    /// <summary>
    /// Find next match
    /// </summary>
    /// <returns></returns>
    public int Next()
    {
        int iMatchIndex = -1;

        //We start comparison of pat[iSearchPatternPosition_m] with iSearchPatternPosition_m = 0 with characters of current window of text.
        //We keep matching characters txt[iDataPosition_m] and pat[iSearchPatternPosition_m] and keep incrementing iDataPosition_m and iSearchPatternPosition_m while 
        //pat[iSearchPatternPosition_m] and txt[iDataPosition_m] keep matching.
        while (iDataPosition_m < this.iDataLength_m)
        {
            if (this.searchPattern_m[iSearchPatternPosition_m] == this.data_m[iDataPosition_m])
            {
                iSearchPatternPosition_m++;
                iDataPosition_m++;
            }

            if (iSearchPatternPosition_m == searchPattern_m.Length)
            {
                //Console.WriteLine("Found SearchPattern at index %d ", iDataPosition_m - iSearchPatternPosition_m);
                iMatchIndex = iDataPosition_m - iSearchPatternPosition_m;

                iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
            }

            // mismatch after j matches 
            else if (iDataPosition_m < this.iDataLength_m && this.searchPattern_m[iSearchPatternPosition_m] != this.data_m[iDataPosition_m])
            {
                //When we see a mismatch
                //* We know that characters pat[0..iSearchPatternPosition_m - 1] match with txt[iDataPosition_m-iSearchPatternPosition_m..iDataPosition_m - 1] 
                //  (Note that iSearchPatternPosition_m starts with 0 and increment it only when there is a match).
                //* We also know (from above definition) that lps[iSearchPatternPosition_m - 1] is count of characters of pat[0…iSearchPatternPosition_m - 1] 
                //  that are both proper prefix and suffix.
                //* From above two points, we can conclude that we do not need to match these lps[iSearchPatternPosition_m - 1] characters with 
                //  txt[iDataPosition_m  -iSearchPatternPosition_m..iDataPosition_m - 1] because we know that 
                //  these characters will anyway match. Let us consider above example to understand this.

                // Do not match lps[0..lps[iSearchPatternPosition_m - 1]] characters, 
                // they will match anyway 
                if (iSearchPatternPosition_m != 0)
                {
                    iSearchPatternPosition_m = this.lstLongestProperPrefix_m[iSearchPatternPosition_m - 1];
                }
                else
                {
                    iDataPosition_m = iDataPosition_m + 1;
                }
            }

            if (iMatchIndex > -1)
            {
                return iMatchIndex;
            }
        }
        return iMatchIndex;
    }

}

If I understand you correctly, the goal is to scan an ongoing stream and replace all occurences of a mask with a replacement value before forwarding it to an output stream.

If so, there are a few requirements to the system:

  • Only data may be forwarded for which can be proven that it is not part of a complete mask
  • Data must be forwarded as soon as possible to keep congestion to a minimum

Abstract:
In essence, you buffer all the data before you forward it.
After each character, you test whether the buffer could be part of the mask.
If it couldn't be part of the mask, you release (write to output) one character after the other from the buffer until either the buffer is empty, or the buffer could match the mask again.

Code:
I made a helper class that does essentially that.
I don't know whether it implements your algorithm, this is just something I cooked up.
According to my tests, it works, but I can't guarantee anything.
You can use it in your stream class and just forward the calls and recive the callbacks.
Forward input data via the Write([...]) methods, get output via the output callback and define the mask and replacement string. It supports flushing and reseting.

using System;
using System.Collections.Generic;

/// <summary>
/// Class that handles replacement of string occurrences that match a mask in a string stream. 
/// </summary>
public class StreamReplacement
{
    private Action<char> output;

    private string mask;

    private string replacement;

    // Buffer for unverified characters
    private readonly List<char> buffer;

    // Current index on the mask for the next character added
    private int maskPosition;

    /// <summary>
    /// All verified characters will be passed to the output
    /// </summary>
    /// <exception cref="System.ArgumentNullException">The output cannot be null.</exception>
    public Action<char> Output
    {
        get => output;
        set => output = value ?? throw new ArgumentNullException();
    }

    /// <summary>
    /// Gets or sets the mask to test for.
    /// </summary>
    /// <exception cref="System.ArgumentException">The mask must contain at least one character.</exception>
    public string Mask
    {
        get => this.mask;
        set
        {
            if (string.IsNullOrEmpty(mask))
            {
                throw new ArgumentException("The mask must contain at least one character.");
            }

            this.mask = value;

            // Clean buffer and throw out characters that can't be part of the mask anymore.
            WriteVerifiedCharactersFromBuffer();
        }
    }

    /// <summary>
    /// Gets or sets the replacement string to output when the mask has been encountered.
    /// </summary>
    public string Replacement
    {
        get => replacement;
        set => replacement = value ?? string.Empty;
    }

    /// <summary>
    /// Initializes a new instance of the <see cref="StreamReplacement"/> class.
    /// </summary>
    /// <param name="output">The output.</param>
    /// <param name="mask">The mask.</param>
    /// <param name="replacement">The replacement string for mask encounters.</param>
    /// <exception cref="System.ArgumentNullException">Output cannot be null.</exception>
    /// <exception cref="System.ArgumentException">The mask must contain at least one character.</exception>
    public StreamReplacement(Action<char> output, string mask, string replacement)
    {
        this.output = output ?? throw new ArgumentNullException(nameof(output));
        this.mask = string.IsNullOrEmpty(mask) ? throw new ArgumentException("The mask must contain at least one character.") : mask;
        this.replacement = replacement ?? string.Empty;
        this.buffer = new List<char>(mask.Length);
    }

    /// <summary>
    /// Writes a single character to the stream.
    /// </summary>
    public void Write(char c)
    {
        WriteCharacter(c);
    }

    /// <summary>
    /// Writes the specified text to the stream.
    /// </summary>
    public void Write(string text)
    {
        if (string.IsNullOrEmpty(text))
        {
            return;
        }

        foreach (var c in text)
        {
            WriteCharacter(c);
        }
    }

    /// <summary>
    /// Writes the specified characters to the stream.
    /// </summary>
    /// <param name="characters">The characters.</param>
    public void Write(params char[] characters)
    {
        if (characters == null)
        {
            return;
        }

        foreach (var c in characters)
        {
            WriteCharacter(c);
        }
    }

    /// <summary>
    /// Flushes all buffered characters, even if they could be part of the mask.<br/>
    /// Starts from scratch after flushing.
    /// </summary>
    public void Flush()
    {
        foreach (var c in buffer)
        {
            output(c);
        }

        buffer.Clear();
        maskPosition = 0;
    }

    /// <summary>
    /// Clears the buffer without writing any buffered data to the output stream.
    /// </summary>
    public void Reset()
    {
        buffer.Clear();
        maskPosition = 0;
    }

    /// <summary>
    /// Writes the character either to the buffer or output stream and handles masking.
    /// </summary>
    private void WriteCharacter(char c)
    {
        if (mask[maskPosition] == c)
        {
            AddUnverifiedCharacter(c);
        }
        else
        {
            buffer.Add(c);
            WriteVerifiedCharactersFromBuffer();
        }
    }

    /// <summary>
    /// Stores a character in the buffer that could be part of the mask.
    /// </summary>
    private void AddUnverifiedCharacter(char c)
    {
        buffer.Add(c);
        maskPosition++;

        if (maskPosition == mask.Length)
        {
            buffer.Clear();
            maskPosition = 0;

            foreach (var repChar in replacement)
            {
                output(repChar);
            }
        }
    }

    /// <summary>
    /// Writes characters that cannot be part of the mask to the output.
    /// </summary>
    private void WriteVerifiedCharactersFromBuffer()
    {
        // Find possible new start position in buffer
        var newStart = 0;

        for (; newStart < buffer.Count; newStart++)
        {
            if (IsBufferPartOfMask(newStart))
            {
                WritePartOfBuffer(newStart);
                buffer.RemoveRange(0, newStart);
                maskPosition = buffer.Count;
                return;
            }
        }

        WritePartOfBuffer(buffer.Count);
        buffer.Clear();
        maskPosition = 0;
    }

    /// <summary>
    /// Determines whether the buffer content can be part of the mask, starting at the given index.
    /// </summary>
    private bool IsBufferPartOfMask(int offset)
    {
        for (var i = 0; i < buffer.Count - offset; i++)
        {
            var c = buffer[offset + i];
            if (c != mask[i])
            {
                return false;
            }
        }

        return true;
    }

    /// <summary>
    /// Writes the beginning part of buffer to the output
    /// </summary>
    private void WritePartOfBuffer(int count)
    {
        for (var i = 0; i < count; i++)
        {
            output(buffer[i]);
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM