Equality Comparison with RTF Strings

Question

I have a program that snags copied data and stores it for later use. Items that are equal or at least equivalent are not supposed to be added again to the list. A problem occurs with rich text strings.

For my purposes, the strings should be considered equal if they have the same plain-text result and the same formatting. Correct me if I'm wrong, but I understand that there is an embedded RSID that is created a RTF string is copied, and it is different for each copied RTF string. I am currently removing all RSIDs with Regex.

However, the same one-word string copied twice from Microsoft Word gives me two RTF strings that are considered unequal, even when I strip them of their RSIDs.

Using C#, how can I compare these strings by their plain-text content and formatting only?

My function currently looks like this:

private bool HasEquivalentRichText(string richText1, string richText2)
{
    var rsidRegex = new Regex("(rsid[0-9]+)");
    var cleanText1 = rsidRegex.Replace(richText1, string.Empty);
    var cleanText2 = rsidRegex.Replace(richText2, string.Empty);

    return cleanText1.Equals(cleanText2);
}

Answer 1

When Word converts a Word file to an RTF (note - Word doc) file, it attempts to capture the original document with complete fidelity by including a variety of proprietary tokens. One of these is {\\*\\datastore , and it seems as though, for whatever reason, something inside the datastore (perhaps a copy counter?) gets modified after each copy operation. (This datastore is reported to be encrypted binary data and its internals seem to be undocumented so I cannot tell exactly why it changes after each paste.)

As long as you don't need to paste the data back into Word, you can strip this proprietary information as well as the rsid group:

    /// <summary>
    /// Remove a group from the incoming RTF string starting with {\groupBeginningControlWord
    /// </summary>
    /// <param name="rtf"></param>
    /// <param name="groupBeginningControlWord"></param>
    /// <returns></returns>
    static string RemoveRtfGroup(string rtf, string groupBeginningControlWord)
    {
        // see http://www.biblioscape.com/rtf15_spec.htm
        string groupBeginning = "{\\" + groupBeginningControlWord;
        int index;
        while ((index = rtf.IndexOf(groupBeginning)) >= 0)
        {
            int nextIndex = index + groupBeginning.Length;
            for (int depth = 1; depth > 0 && nextIndex < rtf.Length; nextIndex++)
            {
                if (rtf[nextIndex] == '}')
                    depth--;
                else if (rtf[nextIndex] == '{')
                    depth++;
                if (depth == 0)
                    rtf = rtf.Remove(index, nextIndex - index + 1);
            }
        }

        return rtf;
    }

    static string CleanNonFormattingFromRtf(string rtf)
    {
        var rsidRegex = new Regex("(rsid[0-9]+)");

        var cleanText = rsidRegex.Replace(rtf, string.Empty);
        cleanText = RemoveRtfGroup(cleanText, @"*\datastore");
        return cleanText;
    }

This seems to work in a simple test case of copying a single word from a Word document twice.

Update

After some further investigation, it seems that you may not be able to reliably determine equality of RTF strings copied from Word simply by excising undesired metadeta, and comparing the results.

You didn't provide a minimal, complete and verifiable example of a Word doc which generates different RTF for identical copy-buffer operations, so I used a page from the Microsoft RTF spec :

在此输入图像描述

Given this, I first found it was necessary to remove the entire *\\rsidtbl group:

    static string CleanNonFormattingFromRtf(string rtf)
    {
        var rsidRegex = new Regex("(rsid[0-9]+)");

        var cleanText = rtf;
        cleanText = RemoveRtfGroup(cleanText, @"*\datastore");
        cleanText = RemoveRtfGroup(cleanText, @"*\rsidtbl");
        cleanText = rsidRegex.Replace(cleanText, string.Empty);
        return cleanText;
    }

Secondly, I found that Word will introduce cosmetic CRLFs into the RTF for readability around every 255 characters, plus or minus. These are generally to be ignored when parsing the document, however changes to the rsidtbl could cause these line breaks to be inserted at different locations! Thus it's necessary to remove such cosmetic breaks -- but not all line breaks are cosmetic in RTF. Those in binary sections, and those serving as delimiters for control words, must needs be retained. Thus it's necessary to write an elementary parser and tokenizer just to strip the unnecessary line breaks, eg:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Globalization;

public class RtfNormalizer
{
    public RtfNormalizer(string rtf)
    {
        if (rtf == null)
            throw new ArgumentNullException();
        Rtf = rtf;
    }

    public string Rtf { get; private set; }

    public string GetNormalizedString()
    {
        StringBuilder sb = new StringBuilder();
        var tokenizer = new RtfTokenizer(Rtf);

        RtfToken previous = RtfToken.None;
        while (tokenizer.MoveNext())
        {
            previous = AddCurrentToken(tokenizer, sb, previous);
        }

        return sb.ToString();
    }

    private RtfToken AddCurrentToken(RtfTokenizer tokenizer, StringBuilder sb, RtfToken previous)
    {
        var token = tokenizer.Current;
        switch (token.Type)
        {
            case RtfTokenType.None:
                break;
            case RtfTokenType.StartGroup:
                AddPushGroup(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.EndGroup:
                AddPopGroup(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.ControlWord:
                AddControlWord(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.ControlSymbol:
                AddControlSymbol(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.IgnoredDelimiter:
                AddIgnoredDelimiter(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.CRLF:
                AddCarriageReturn(tokenizer, token, sb, previous);
                break;
            case RtfTokenType.Content:
                AddContent(tokenizer, token, sb, previous);
                break;
            default:
                Debug.Assert(false, "Unknown token type " + token.ToString());
                break;
        }
        return token;
    }

    private void AddPushGroup(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        AddContent(tokenizer, token, sb, previous);
    }

    private void AddPopGroup(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        AddContent(tokenizer, token, sb, previous);
    }

    const string binPrefix = @"\bin";

    bool IsBinaryToken(RtfToken token, out int binaryLength)
    {
        // Rich Text Format (RTF) Specification, Version 1.9.1, p 209:
        //      Remember that binary data can occur when you’re skipping RTF.
        //      A simple way to skip a group in RTF is to keep a running count of the opening braces the RTF reader 
        //      has encountered in the RTF stream. When the RTF reader sees an opening brace, it increments the count. 
        //      When the reader sees a closing brace, it decrements the count. When the count becomes negative, the end 
        //      of the group was found. Unfortunately, this does not work when the RTF file contains a \binN control; the 
        //      reader must explicitly check each control word found to see if it is a \binN control, and if found, 
        //      skip that many bytes before resuming its scanning for braces.
        if (string.CompareOrdinal(binPrefix, 0, token.Rtf, token.StartIndex, binPrefix.Length) == 0)
        {
            if (RtfTokenizer.IsControlWordNumericParameter(token, token.StartIndex + binPrefix.Length))
            {
                bool ok = int.TryParse(token.Rtf.Substring(token.StartIndex + binPrefix.Length, token.Length - binPrefix.Length),
                    NumberStyles.Integer, CultureInfo.InvariantCulture, 
                    out binaryLength);
                return ok;
            }
        }
        binaryLength = -1;
        return false;
    }

    private void AddControlWord(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        // Carriage return, usually ignored.
        // Rich Text Format (RTF) Specification, Version 1.9.1, p 151:
        // RTF writers should not use the carriage return/line feed (CR/LF) combination to break up pictures 
        // in binary format. If they do, the CR/LF combination is treated as literal text and considered part of the picture data.
        AddContent(tokenizer, token, sb, previous);
        int binaryLength;
        if (IsBinaryToken(token, out binaryLength))
        {
            if (tokenizer.MoveFixedLength(binaryLength))
            {
                AddContent(tokenizer, tokenizer.Current, sb, previous);
            }
        }
    }

    private void AddControlSymbol(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        AddContent(tokenizer, token, sb, previous);
    }

    private static bool? CanMergeToControlWord(RtfToken previous, RtfToken next)
    {
        if (previous.Type != RtfTokenType.ControlWord)
            throw new ArgumentException();
        if (next.Type == RtfTokenType.CRLF)
            return null; // Can't tell
        if (next.Type != RtfTokenType.Content)
            return false;
        if (previous.Length < 2)
            return false; // Internal error?
        if (next.Length < 1)
            return null; // Internal error?
        var lastCh = previous.Rtf[previous.StartIndex + previous.Length - 1];
        var nextCh = next.Rtf[next.StartIndex];
        if (RtfTokenizer.IsAsciiLetter(lastCh))
        {
            return RtfTokenizer.IsAsciiLetter(nextCh) || RtfTokenizer.IsAsciiMinus(nextCh) || RtfTokenizer.IsAsciiDigit(nextCh);
        }
        else if (RtfTokenizer.IsAsciiMinus(lastCh))
        {
            return RtfTokenizer.IsAsciiDigit(nextCh);
        }
        else if (RtfTokenizer.IsAsciiDigit(lastCh))
        {
            return RtfTokenizer.IsAsciiDigit(nextCh);
        }
        else
        {
            Debug.Assert(false, "unknown final character for control word token \"" + previous.ToString() + "\"");
            return false;
        }
    }

    bool IgnoredDelimiterIsRequired(RtfTokenizer tokenizer, RtfToken token, RtfToken previous)
    {
        // Word inserts required delimiters when required, and optional delimiters for beautification 
        // and readability.  Strip the optional delimiters while retaining the required ones.
        if (previous.Type != RtfTokenType.ControlWord)
            return false;
        var current = tokenizer.Current;
        try
        {
            while (tokenizer.MoveNext())
            {
                var next = tokenizer.Current;
                var canMerge = CanMergeToControlWord(previous, next);
                if (canMerge == null)
                    continue;
                return canMerge.Value;
            }
        }
        finally
        {
            tokenizer.MoveTo(current);
        }
        return false;
    }

    private void AddIgnoredDelimiter(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        // Rich Text Format (RTF) Specification, Version 1.9.1, p 151:
        // an RTF file does not have to contain any carriage return/line feed pairs (CRLFs) and CRLFs should be ignored by RTF readers except that 
        // they can act as control word delimiters. RTF files are more readable when CRLFs occur at major group boundaries.
        //
        // but then later:
        // 
        // If a single space delimits the control word, the space does not appear in the document (it’s ignored). Any characters following the single space delimiter, including any subsequent spaces, 
        // will appear as text or spaces in the document. For this reason, you should use spaces only where necessary. It is recommended to avoid spaces as a means of breaking up RTF syntax to make 
        // it easier to read. You can use paragraph marks (CR, LF, or CRLF) to break up lines without changing the meaning except in destinations that contain \binN. 
        // In this document, a control word that takes a numeric parameter N is written with the N, as shown here for \binN, unless the control word appears with an explicit value. The only exceptions to 
        // this are “toggle” control words like \b (bold toggle), which have only two states. When such a control word has no parameter or has a nonzero parameter, the control word turns the property on. 
        // When such a control word has a parameter of 0, the control word turns the property off. For example, \b turns on bold and \b0 turns off bold. In the definitions of these toggle control words, 
        // the control word names are followed by an asterisk.
        if (IgnoredDelimiterIsRequired(tokenizer, token, previous))
            // There *May* be a need for a delimiter, 
            AddContent(tokenizer, " ", sb, previous);
    }

    private void AddCarriageReturn(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        // DO NOTHING.
    }

    private void AddContent(RtfTokenizer tokenizer, RtfToken token, StringBuilder sb, RtfToken previous)
    {
        sb.Append(token.ToString());
    }

    private void AddContent(RtfTokenizer tokenizer, string content, StringBuilder sb, RtfToken previous)
    {
        sb.Append(content);
    }
}

And

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

public enum RtfTokenType
{
    None = 0,
    StartGroup,
    EndGroup,
    CRLF,
    ControlWord,
    ControlSymbol,
    IgnoredDelimiter,
    Content,
}

public struct RtfToken : IEquatable<RtfToken>
{
    public static RtfToken None { get { return new RtfToken(); } }

    public RtfToken(RtfTokenType type, int startIndex, int length, string rtf)
        : this()
    {
        this.Type = type;
        this.StartIndex = startIndex;
        this.Length = length;
        this.Rtf = rtf;
    }
    public RtfTokenType Type { get; private set; }

    public int StartIndex { get; private set; }

    public int Length { get; private set; }

    public string Rtf { get; private set; }

    public bool IsEmpty { get { return Rtf == null; } }

    #region IEquatable<RtfToken> Members

    public bool Equals(RtfToken other)
    {
        if (this.Type != other.Type)
            return false;
        if (this.Length != other.Length)
            return false;
        if (this.IsEmpty)
            return other.IsEmpty;
        else 
            return string.CompareOrdinal(this.Rtf, StartIndex, other.Rtf, other.StartIndex, Length) == 0;
    }

    public static bool operator ==(RtfToken first, RtfToken second)
    {
        return first.Equals(second);
    }

    public static bool operator !=(RtfToken first, RtfToken second)
    {
        return !first.Equals(second);
    }
    #endregion

    public override string ToString()
    {
        if (Rtf == null)
            return string.Empty;
        return Rtf.Substring(StartIndex, Length);
    }

    public override bool Equals(object obj)
    {
        if (obj is RtfToken)
            return Equals((RtfToken)obj);
        return false;
    }

    public override int GetHashCode()
    {
        if (Rtf == null)
            return 0;
        int code = Type.GetHashCode() ^ Length.GetHashCode();
        for (int i = StartIndex; i < Length; i++)
            code ^= Rtf[i].GetHashCode();
        return code;
    }
}

public class RtfTokenizer : IEnumerator<RtfToken> 
{
    public RtfTokenizer(string rtf)
    {
        if (rtf == null)
            throw new ArgumentNullException();
        Rtf = rtf;
    }

    public string Rtf { get; private set; }

#if false
    Rich Text Format (RTF) Specification, Version 1.9.1:
    Control Word
    An RTF control word is a specially formatted command used to mark characters for display on a monitor or characters destined for a printer. A control word’s name cannot be longer than 32 letters. 
    A control word is defined by:
    \<ASCII Letter Sequence><Delimiter>
    where <Delimiter> marks the end of the control word’s name. For example:
    \par
    A backslash begins each control word and the control word is case sensitive.
    The <ASCII Letter Sequence> is made up of ASCII alphabetical characters (a through z and A through Z). Control words (also known as keywords) originally did not contain any uppercase characters, however in recent years uppercase characters appear in some newer control words.
    The <Delimiter> can be one of the following:
    •   A space. This serves only to delimit a control word and is ignored in subsequent processing.
    •   A numeric digit or an ASCII minus sign (-), which indicates that a numeric parameter is associated with the control word. The subsequent digital sequence is then delimited by any character other than an ASCII digit (commonly another control word that begins with a backslash). The parameter can be a positive or negative decimal number. The range of the values for the number is nominally –32768 through 32767, i.e., a signed 16-bit integer. A small number of control words take values in the range −2,147,483,648 to 2,147,483,647 (32-bit signed integer). These control words include \binN, \revdttmN, \rsidN related control words and some picture properties like \bliptagN. Here N stands for the numeric parameter. An RTF parser must allow for up to 10 digits optionally preceded by a minus sign. If the delimiter is a space, it is discarded, that is, it’s not included in subsequent processing.
    •   Any character other than a letter or a digit. In this case, the delimiting character terminates the control word and is not part of the control word. Such as a backslash “\”, which means a new control word or a control symbol follows.
    If a single space delimits the control word, the space does not appear in the document (it’s ignored). Any characters following the single space delimiter, including any subsequent spaces, will appear as text or spaces in the document. For this reason, you should use spaces only where necessary. It is recommended to avoid spaces as a means of breaking up RTF syntax to make it easier to read. You can use paragraph marks (CR, LF, or CRLF) to break up lines without changing the meaning except in destinations that contain \binN. 
    In this document, a control word that takes a numeric parameter N is written with the N, as shown here for \binN, unless the control word appears with an explicit value. The only exceptions to this are “toggle” control words like \b (bold toggle), which have only two states. When such a control word has no parameter or has a nonzero parameter, the control word turns the property on. When such a control word has a parameter of 0, the control word turns the property off. For example, \b turns on bold and \b0 turns off bold. In the definitions of these toggle control words, the control word names are followed by an asterisk.
#endif

    public static bool IsAsciiLetter(char ch)
    {
        if (ch >= 'a' && ch <= 'z')
            return true;
        if (ch >= 'A' && ch <= 'Z')
            return true;
        return false;
    }

    public static bool IsAsciiDigit(char ch)
    {
        if (ch >= '0' && ch <= '9')
            return true;
        return false;
    }

    public static bool IsAsciiMinus(char ch)
    {
        return ch == '-';
    }

    public static bool IsControlWordNumericParameter(RtfToken token, int startIndex)
    {
        int inLength = token.Length - startIndex;
        int actualLength;
        if (IsControlWordNumericParameter(token.Rtf, token.StartIndex + startIndex, out actualLength)
            && actualLength == inLength)
        {
            return true;
        }
        return false;
    }

    static bool IsControlWordNumericParameter(string rtf, int startIndex, out int length)
    {
        int index = startIndex;
        if (index < rtf.Length - 1 && IsAsciiMinus(rtf[index]) && IsAsciiDigit(rtf[index + 1]))
            index++;
        for (; index < rtf.Length && IsAsciiDigit(rtf[index]); index++)
            ;
        length = index - startIndex;
        return length > 0;
    }

    static bool IsControlWord(string rtf, int startIndex, out int length)
    {
        int index = startIndex;
        for (; index < rtf.Length && IsAsciiLetter(rtf[index]); index++)
            ;
        length = index - startIndex;
        if (length == 0)
            return false;
        int paramLength;
        if (IsControlWordNumericParameter(rtf, index, out paramLength))
            length += paramLength;
        return true;
    }

    public IEnumerable<RtfToken> AsEnumerable()
    {
        int oldPos = nextPosition;
        RtfToken oldCurrent = current;
        try
        {
            while (MoveNext())
                yield return Current;
        }
        finally
        {
            nextPosition = oldPos;
            current = oldCurrent;
        }
    }

    string RebuildRtf()
    {
        string newRtf = AsEnumerable().Aggregate(new StringBuilder(), (sb, t) => sb.Append(t.ToString())).ToString();
        return newRtf;
    }

    [Conditional("DEBUG")]
    public void AssertValid()
    {
        var newRtf = RebuildRtf();
        if (Rtf != newRtf)
        {
            Debug.Assert(false, "rebuilt rtf mismatch");
        }
    }

    #region IEnumerator<RtfToken> Members

    int nextPosition = 0;
    RtfToken current = new RtfToken();

    public RtfToken Current
    {
        get {
            return current;
        }
    }

    #endregion

    #region IDisposable Members

    public void Dispose()
    {
    }

    #endregion

    #region IEnumerator Members

    object System.Collections.IEnumerator.Current
    {
        get { return Current; }
    }

    public void MoveTo(RtfToken token)
    {
        if (token.Rtf != Rtf)
            throw new ArgumentException();
        nextPosition = token.StartIndex + token.Length;
        current = token;
    }

    public bool MoveFixedLength(int length)
    {
        if (nextPosition >= Rtf.Length)
            return false;
        int actualLength = Math.Min(length, Rtf.Length - nextPosition);
        current = new RtfToken(RtfTokenType.Content, nextPosition, actualLength, Rtf);
        nextPosition += actualLength;
        return true;
    }

    static string crlf = "\r\n";

    static bool IsCRLF(string rtf, int startIndex)
    {
        return string.CompareOrdinal(crlf, 0, rtf, startIndex, crlf.Length) == 0;
    }

    public bool MoveNext()
    {
        // As previously mentioned, the backslash (\) and braces ({ }) have special meaning in RTF. To use these characters as text, precede them with a backslash, as in the control symbols \\, \{, and \}.
        if (nextPosition >= Rtf.Length)
            return false;
        RtfToken next = new RtfToken();

        if (Rtf[nextPosition] == '{')
        {
            next = new RtfToken(RtfTokenType.StartGroup, nextPosition, 1, Rtf);
        }
        else if (Rtf[nextPosition] == '}')
        {
            // End group
            next = new RtfToken(RtfTokenType.EndGroup, nextPosition, 1, Rtf);
        }
        else if (IsCRLF(Rtf, nextPosition))
        {
            if (current.Type == RtfTokenType.ControlWord)
                next = new RtfToken(RtfTokenType.IgnoredDelimiter, nextPosition, crlf.Length, Rtf);
            else
                next = new RtfToken(RtfTokenType.CRLF, nextPosition, crlf.Length, Rtf);
        }
        else if (Rtf[nextPosition] == ' ')
        {
            if (current.Type == RtfTokenType.ControlWord)
                next = new RtfToken(RtfTokenType.IgnoredDelimiter, nextPosition, 1, Rtf);
            else
                next = new RtfToken(RtfTokenType.Content, nextPosition, 1, Rtf);
        }
        else if (Rtf[nextPosition] == '\\')
        {
            if (nextPosition == Rtf.Length - 1)
                next = new RtfToken(RtfTokenType.Content, nextPosition, 1, Rtf); // Junk file?
            else
            {
                int length;
                if (IsControlWord(Rtf, nextPosition + 1, out length))
                {
                    next = new RtfToken(RtfTokenType.ControlWord, nextPosition, length + 1, Rtf);
                }
                else
                {
                    // Control symbol.
                    next = new RtfToken(RtfTokenType.ControlSymbol, nextPosition, 2, Rtf);
                }
            }
        }
        else
        {
            // Content
            next = new RtfToken(RtfTokenType.Content, nextPosition, 1, Rtf);
        }

        if (next.Length == 0)
            throw new Exception("internal error");
        current = next;
        nextPosition = next.StartIndex + next.Length;
        return true;
    }

    public void Reset()
    {
        nextPosition = 0;
    }

    #endregion
}

This fixed many false reports of differences between identical copy operations -- but some remained when copying multiple lines of lists or tables. For some reason, it seems Word simply doesn't generate the same RTF for long, complex formatting for seemingly identical copies.

You may need to investigate a different approach, for instance pasting the RTF into a RichTextBox and then comparing the resulting XAML.

Equality Comparison with RTF Strings

Question

1 answers

solution1
3 ACCPTED 2014-09-16 19:11:53

Equality Comparison with RTF Strings

Question

1 answers

solution1 3 ACCPTED 2014-09-16 19:11:53

solution1
3 ACCPTED 2014-09-16 19:11:53