简体   繁体   中英

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.

For example

-
1. This is a paragraph
It may go over multiple lines
-

Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)

The code I am trying to use is basically as follows

Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String

matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True

fileName = "C:\file.txt"
fileNo = FreeFile

Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)

For Each theMatch in matches
    MsgBox(theMatch.Value)
Next theMatch

Close #fileNo

I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping

-?\\d.*?-

However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.

I have checked the length of theMatch.Value with:

MsgBox(len(theMatch.Value))

and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.

I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.

The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)

Here is a screenshot:

regex101截图

This simple approach does not use Regex . It assumes the data is in column A and the paragraphs are placed in column B :

Sub paragraph_no_regex()
    Dim s As String
    Dim ary

    With Application.WorksheetFunction
        s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
    End With

    ary = Split(s, "-")
    i = 1
    For Each a In ary
        Cells(i, 2) = a
        i = i + 1
    Next a
End Sub

在此处输入图片说明

Sub F()

    Dim re As New RegExp
    Dim sMatch As String
    Dim document As String

    re.Pattern = "-\n((.|\n)+?)\n-"

    'Getting document
    document = ...

    sMatch = re.Execute(document)(0).SubMatches(0)

End Sub

If you need dashes - , then just include them into capture group (the outer parenthesis).

This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):

matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"

It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.

As added benefit leading and trailing whitespace is cut off ("\\s*" outside the group).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM