Word VBA: iterating through characters incredibly slow

Question

I have a macro that changes single quotes in front of a number to an apostrophe (or close single curly quote). Typically when you type something like "the '80s" in word, the apostrophe in front of the "8" faces the wrong way. The macro below works, but it is incredibly slow (like 10 seconds per page). In a regular language (even an interpreted one), this would be a fast procedure. Any insights why it takes so long in VBA on Word 2007? Or if someone has some find+replace skills that can do this without iterating, please let me know.

Sub FixNumericalReverseQuotes()
    Dim char As Range
    Debug.Print "starting " + CStr(Now)
    With Selection
        total = .Characters.Count
        ' Will be looking ahead one character, so we need at least 2 in the selection
        If total < 2 Then
            Return
        End If
        For x = 1 To total - 1
            a_code = Asc(.Characters(x))
            b_code = Asc(.Characters(x + 1))

            ' We want to convert a single quote in front of a number to an apostrophe
            ' Trying to use all numerical comparisons to speed this up
            If (a_code = 145 Or a_code = 39) And b_code >= 48 And b_code <= 57 Then
                .Characters(x) = Chr(146)
            End If 
        Next x
    End With
    Debug.Print "ending " + CStr(Now)
End Sub

Answer 1

Beside two specified ( Why...? and How to do without...? ) there is an implied question – how to do proper iteration through Word object collection. Answer is – to use obj .Next property rather than access by index. That is, instead of:

For i = 1 to ActiveDocument.Characters.Count
    'Do something with ActiveDocument.Characters(i), e.g.:
    Debug.Pring ActiveDocument.Characters(i).Text
Next

one should use:

Dim ch as Range: Set ch = ActiveDocument.Characters(1)
Do
    'Do something with ch, e.g.:
    Debug.Print ch.Text
    Set ch = ch.Next 'Note iterating
Loop Until ch is Nothing

Timing: 00:03:30 vs. 00:00:06, more than 3 minutes vs. 6 seconds.

Found on Google, link lost, sorry. Confirmed by personal exploration.

Answer 2

This is a problem begging for regular expressions. Resolving the .Characters calls that many times is probably what is killing you in performance.

I'd do something like this:

Public Sub FixNumericalReverseQuotesFast()

    Dim expression As RegExp
    Set expression = New RegExp

    Dim buffer As String
    buffer = Selection.Range.Text

    expression.Global = True
    expression.MultiLine = True
    expression.Pattern = "[" & Chr$(145) & Chr$(39) & "]\d"

    Dim matches As MatchCollection
    Set matches = expression.Execute(buffer)

    Dim found As Match
    For Each found In matches
        buffer = Replace(buffer, found, Chr$(146) & Right$(found, 1))
    Next

    Selection.Range.Text = buffer

End Sub

NOTE: Requires a reference to Microsoft VBScript Regular Expressions 5.5 (or late binding).

EDIT: The solution without using the Regular Expressions library is still avoiding working with Ranges. This can easily be converted to working with a byte array instead:

Sub FixNumericalReverseQuotes()
    Dim chars() As Byte
    chars = StrConv(Selection.Text, vbFromUnicode)

    Dim pos As Long
    For pos = 0 To UBound(chars) - 1
        If (chars(pos) = 145 Or chars(pos) = 39) _
        And (chars(pos + 1) >= 48 And chars(pos + 1) <= 57) Then
           chars(pos) = 146
        End If
    Next pos

    Selection.Text = StrConv(chars, vbUnicode)
End Sub

Benchmarks (100 iterations, 3 pages of text with 100 "hits" per page):

Regex method: 1.4375 seconds
Array method: 2.765625 seconds
OP method: (Ended task after 23 minutes)

About half as fast as the Regex, but still roughly 10ms per page.

EDIT 2: Apparently the methods above are not format safe, so method 3:

Sub FixNumericalReverseQuotesVThree()

    Dim full_text As Range
    Dim cached As Long

    Set full_text = ActiveDocument.Range
    full_text.Find.ClearFormatting
    full_text.Find.MatchWildcards = True
    cached = full_text.End

    Do While full_text.Find.Execute("[" & Chr$(145) & Chr$(39) & "][0-9]")
        full_text.End = full_text.Start + 2
        full_text.Characters(1) = Chr$(96)
        full_text.Start = full_text.Start + 1
        full_text.End = cached
    Loop

End Sub

Again, slower than both the above methods, but still runs reasonably fast (on the order of ms).

Answer 3

Modified version of @Comintern's "Array method":

Sub FixNumericalReverseQuotes()
    Dim chars() As Byte
    chars = StrConv(Selection.Text, vbFromUnicode)

    Dim pos As Long
    For pos = 0 To UBound(chars) - 1
        If (chars(pos) = 145 Or chars(pos) = 39) _
        And (chars(pos + 1) >= 48 And chars(pos + 1) <= 57) Then
           ' Make the change directly in the selection so track changes is sensible.
           ' I have to use 213 instead of 146 for reasons I don't understand--
           ' probably has to do with encoding on Mac, but anyway, this shows the change.
           Selection.Characters(pos + 1) = Chr(213)
        End If
    Next pos
End Sub

Answer 4

Maybe this?

Sub FixNumQuotes()
    Dim MyArr As Variant, MyString As String, X As Long, Z As Long
    Debug.Print "starting " + CStr(Now)
    For Z = 145 To 146
        MyArr = Split(Selection.Text, Chr(Z))
        For X = LBound(MyArr) To UBound(MyArr)
            If IsNumeric(Left(MyArr(X), 1)) Then MyArr(X) = "'" & MyArr(X)
        Next
        MyString = Join(MyArr, Chr(Z))
        Selection.Text = MyString
    Next
    Selection.Text = Replace(Replace(Selection.Text, Chr(146) & "'", "'"), Chr(145) & "'", "'")
    Debug.Print "ending " + CStr(Now)
End Sub

I am not 100% sure on your criteria, I have made both an open and close single quote a ' but you can change that quite easily if you want.

It splits the string to an array on chr(145), checks the first char of each element for a numeric and prefixes it with a single quote if found.

Then it joins the array back to a string on chr(145) then repeats the whole things for chr(146). Finally it looks through the string for an occurence of a single quote AND either of those curled quotes next to each other (because that has to be something we just created) and replaces them with just the single quote we want. This leaves any occurence not next to a number intact.

This final replacement part is the bit you would change if you want something other than ' as the character.

Answer 5

I have been struggling with this for days now. My attempted solution was to use a regular expression on document.text. Then, using the matches in a document.range(start,end), replace the text. This preserves formatting.

The problem is that the start and end in the range do not match the index into text. I think I have found the discrepancy - hidden in the range are field codes (in my case they were hyperlinks). In addition, document.text has a bunch of BEL codes that are easy to strip out. If you loop through a range using the character method, append the characters to a string and print it you will see the field codes that don't show up if you use the .text method.

Amazingly you can get the field codes in document.text if you turn on "show field codes" in one of a number of ways. Unfortunately, that version is not exactly the same as what the range/characters shows - the document.text has just the field code, the range/characters has the field code and the field value. Therefore you can never get the character indices to match.

I have a working version where instead of using range(start,end), I do something like:

Set matchRange = doc.Range.Characters(myMatches(j).FirstIndex + 1)           
matchRange.Collapse (wdCollapseStart)
Call matchRange.MoveEnd(WdUnits.wdCharacter, myMatches(j).Length)
matchRange.text = Replacement

As I say, this works but the first statement is dreadfully slow - it appears that Word is iterating through all of the characters to get to the correct point. In doing so, it doesn't seem to count the field codes, so we get to the correct point.

Bottom line, I have not been able to come up with a good way to match the indexing of the document.text string to an equivalent range(start,end) that is not a performance disaster.

Ideas welcome, and thanks.

Word VBA: iterating through characters incredibly slow

Question

5 answers

solution1
3 ACCPTED 2015-04-25 13:00:32

solution2
0 2015-04-10 04:23:34

solution3
0 2015-04-15 02:01:08

solution4
0 2015-04-15 06:05:37

solution5
0 2015-04-17 16:05:10

Word VBA: iterating through characters incredibly slow

Question

5 answers

solution1 3 ACCPTED 2015-04-25 13:00:32

solution2 0 2015-04-10 04:23:34

solution3 0 2015-04-15 02:01:08

solution4 0 2015-04-15 06:05:37

solution5 0 2015-04-17 16:05:10

solution1
3 ACCPTED 2015-04-25 13:00:32

solution2
0 2015-04-10 04:23:34

solution3
0 2015-04-15 02:01:08

solution4
0 2015-04-15 06:05:37

solution5
0 2015-04-17 16:05:10