简体   繁体   中英

MS Word Find with regular expressions, repeated PATTERNS

I have a document with labeled (and some unlabeled!) paragraphs:
1.0 ...
...
2.4.3 ...
...
6.18.21.8 ...
Etc.

I need to find all those labels, and only those labels (regardless of what the paragraph content is and what other text may be present, eg unlabeled paragraphs/text). The expected document format is this:

  • New paragraph character, followed by
  • One or more number characters, followed by
  • A period, followed by
  • Some number of iterations of the preceding two steps, in order (number characters and a period), followed by
  • One or more number characters, followed by
  • Two spaces

Right now I have this expression, which may be close but isn't right because Word interprets the expression inside the first set of parentheses as me wanting to repeat the match rather than the pattern. (I need the latter.)

^13([0-9]@[\.])@[0-9]@(  )

Any tips on writing a regular expression that will yield the correct results, as described above?

This matches the last 5 step of your patern, I'm not really sure what you mean by new paragraph character, but if it is always the same character, just put it at the beginning of the string.

([0-9]+.)+[0-9]+(  )

If you are opened to using VBA, here is a sub that will replace the matches with whatever you change the replace variable with. Note that you will need to activate the Regex library, which you can learn how to do here (it's for excel but works the same in word). Then add a module and paste the text bellow. I think the new character is either \\n or \\t but I'm not 100% sure about that.

Sub remove()
Dim reg As New RegExp
Dim pattern As String
Dim replace As String

replace = ""
pattern = "([0-9]+.)+[0-9]+(  )"
With reg
    .Global = True
    .MultiLine = True
    .IgnoreCase = False
    .pattern = pattern
End With


If reg.Test(ActiveDocument.Range.Text) Then ActiveDocument.Range.Text = reg.replace(ActiveDocument.Range.Text, replace)

End Sub

Word doesn't seem to comply to its own regex documentation. To some degree, this might be helped by using the Special drop down in the Search and Replace box. In my case, it inserts {;} instead of the documented {,} for Number of repetitions . (Once you know about the semi colon instead of the comma, you may of course insert this yourself... - On the other hand: This does seem to be different even between different versions of Word.) Talking of repetitions, Word exhibits significant trouble in handling these.

You might want to verify this searching your example and a small addition

1.0  ...
...
2.4.3  ...
...
6.18.21.8  ...
...
...1.0  ...

with ^13([0-9]@.)@[0-9]@ . It actually should match the first three number - dot - sequences at the start of the respective lines - but not the fourth, where the line starts with other characters. However, on my version of word, it just matches the very first one. This is in line with ^13([0-9]{1;}.){1;}[0-9]{1;} matching the first one, only - and ^13([0-9]{1;}.){2;}[0-9]{1;} not matching anything at all. (Which mirrors at the same time your observation about repetitions of the exact sequence instead of the pattern to be matched.)

You might want to check the transcription in RegEx 101 as a proof of concept.

The closest possible to your requirements is probably either:

  • ^13[0-9.]{1;} (with the tuned up ^13[0-9.]{1;}.[0-9]{1;} again not working at all) - which unfortunately accepts patterns, you actually want to see excluded, or
  • running ^13[0-9]{1;}.[0-9]{1;} , ^13[0-9]{1;}.[0-9]{1;}.[0-9]{1;} , ^13[0-9]{1;}.[0-9]{1;}.[0-9]{1;}.[0-9]{1;} , etc., which lacks much of the regex beauty/flexibility - but is much more rigid.

Depending on your overall requirements, you might be better off using a different tool for that particular job.

BTW:

  • Word uses ? instead of . to denote any character . This is, why the dot does not need to be escaped in the above expressions.
  • Word should actually accept dot or backslash for [\\.] - but requires [\\\\.] instead (in my version).
  • "Some number of iterations of the preceding two steps" is (along your sample code) read as meaning minimum once .
  • The trailing blanks in the above regex are lost due to the handling of blanks in HTML.
  • If you are using Words functionality for headings (meaning in particular the use of the respective heading styles): Did you at all try using the Outline view (perhaps with the text body not shown) to further your purpose ?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM