简体   繁体   中英

VBA and RegEx matching arbitrary strings in Excel 2010

I need to extract adress and potentially zip code as separate entites from the same line. The address line may or may not contain a zip code, and may or may not contain other unwanted strings. This is due to a bug in a web form, which is fixed, but the damage is already done to a set of elements.

Possible forms and results:

  • Address: Some address 251, 99302 Something Telephone: 555 6798 8473 -- Return "some address 251" and "99302 something" in separate strings. Comma may or may not be trailed by whitespace.
  • Address: Some address 251 -- Return "some address 251"
  • Address: Some address 251, 99302 -- Return "some address 251" and "99302". Again, comma may or may not be trailed by whitespace.

I have a basic understanding of how this could be done programatically in VBA by iterating over the string and checking individual characters and substrings, but I feel like it will be time-consuming and not very robust afterwards. Or if it's robust, it would end up being huge because of all the possible variations.

I am struggling the most with how to form the regular expression(s) and possibly the conditionals to get the desired results.

This is part of a larger project, so I won't paste all the various code, but I am pulling mailitems from Outlook to analyze and dump relevant info into an Excel sheet. I have both the Outlook and Excel code working, but the logic that extracts information is a bit flawed.

Here are the new snippets I've been working on:

Function regexp(str As String, regP As String)

Dim rExp As Object, rMatch As Object

Set rExp = CreateObject("vbscript.regexp")
With rExp
    .Global = False
    .MultiLine = False
    .IgnoreCase = True
    .Pattern = regP
End With

Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
    regexp = rMatch(0)
Else
    RegEx = vbNullString
    Debug.Print "No match found!"
End If

End Function


Sub regexpAddress(str As String)
Dim result As String
Dim pattern As String

If InStr(str, "Telephone:") Then pattern = "/.+?(?=Telephone:)/"
result = regexp(str, pattern)

End Sub

I'm not sure how to form the regexps here. The one outlined should pull the right information (in 1 string instead 2, but that's still an improvement) - but only when the line contains the string "Telephone:", and I have a lot of cases where it won't contain that.

This is the current and somewhat flawed logic, which for some reason doesn't always yield the results I want:

For Each objMail In olFolder.Items

name = ""
address = ""
telephone = ""
email = ""

vIterations = vIterations + 1

arrBody = Split(objMail.body, Chr(10)) ' Split mail body when linebreak is encountered, throwing each line into its own array position
For i = 0 To UBound(arrBody)
    arrLine = Split(arrBody(i), ": ") ' For each element (line), make new array, and if text search matches then write the 2nd half of the element to variable
    If InStr(arrBody(i), "Name:") > 0 Then ' L2
        name = arrLine(1) ' Reference 2nd column in array after the split
    ElseIf InStr(arrBody(i), "Address:") > 0 Then
        address = arrLine(1)
    ElseIf InStr(arrBody(i), "Telephone:") > 0 Then
        telephone = CLng(arrLine(1))
    ElseIf InStr(arrBody(i), "Email:") > 0 Then
        email = arrLine(1)
    End If ' L2
Next
Next ' Next/end-for

This logic accepts and formats input of the following type:

Name: Joe
Address: Road
Telephone: 55555555555555
Email: joe@road.com

and returns joe, road, 55555 and joe@road.com to some defined Excel cells. This works fine when the mailitems are ordered as expected.

Problem: A bug lead to not my webform not inserting a linebreak after the address in some cases. The script still worked for the most part, but the mailitem contents sometimes ended up looking like this:

Name: Joe
Address: Road Telephone: 55555555555555
Email: joe@road.com

The address field was contaminated when it reached Excel ("Road Telephone" instead of just "Road"), but there was no loss of information. Which was acceptable, as it's easy to remove the surpluss string.

But in the following case (no email is entered), the phone number is not only lost but is actually replaced by a phone number from some other, arbitrary mailitem and I can't FOR THE LIFE OF ME figure out (1) why it won't get the correct number, (2) why it jumps to a new mail item to find the phone number or (3) how it selects this other mailitem:

Name: Joe
Address: Road Telephone: 5555555555555
Email: 

In Excel:

Name: Joe
Address: Road Telephone
Telephone: 8877445511
Email: 

So, TL;DR: my selection logic is flawed, and being that it is so hastily hacked together, not to mention how it yields false information and I am unable to figure out how and why, I would like to do a better operation using some other solution (like regexp?) instead for a more robust code.

I don't know if it was dumb luck or if I actually managed to learn some regex, but these patterns turn out to do exactly what I need.

' regex patterns - use flag /i
adrPattern = "([a-z ]{2,}\s{0,1}\d{0,3})" ' Select from a-z or space, case insensitive and at least 2 characters long, followed by optional space, ending with 0-3 digits
adrZipcode = "\b(\d{4})\b" ' Exactly 4 digits surrounded on both sides by either space, text or non-word character like comma

Edit: "Fixed" the telephone problem too. After spending 2 hours trying to write it in regex, and failing miserably, it dawned on me that solving the problem as a matter of faulty creation of the array had to be so much easier than treating it as a computational problem. And it was:

mailHolder = Replace(objMail.body, "Telephone:", Chr(10) + "Telephone:")
arrBody = Split(mailHolder, Chr(10))

Not so long ago I had a similar problem. Code may not be very professional, but it can be helpful :) Could you check if this code work for you correctly?

Function regexp(str As String, regP As String)

Dim rExp As Object, rMatch As Object

Set rExp = CreateObject("vbscript.regexp")
With rExp
    .Global = False
    .MultiLine = False
    .IgnoreCase = True
    .pattern = regP
End With

Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
    regexp = rMatch(0)
Else
    RegEx = vbNullString
    Debug.Print "No match found!"
End If

End Function

Function for_vsoraas()

For Each objMail In olFolder.Items

vIterations = vIterations + 1

objMail_ = Replace(objMail.body, Chr(10), " ")    
Dim StringToSearch(3) As String
StringToSearch(0) = "Name:"
StringToSearch(1) = "Address:"
StringToSearch(2) = "Telephone:"
StringToSearch(3) = "Email:"

Dim ArrResults(4) As String 'name,address,telephone,email, zipcode

For i = 0 To UBound(StringToSearch)
    ResultString = ""
    StartString = InStr(objMail_, StringToSearch(i))
    If StartString > 0 Then
        If i = UBound(StringToSearch) Then 'last string to search, dont search EndString
        ResultString = Right(objMail_, Len(objMail_) + Len(StringToSearch(i)))
        Else
        EndString = 0
        j = i
        While (EndString = 0) 'prevent case no existing EndString
        EndString = InStr(objMail_, StringToSearch(j + 1))
        j = j + 1
            If j = UBound(StringToSearch) And EndString = 0 Then
            EndString = Len(objMail_) + 1
            End If
        Wend
        ResultString = Mid(objMail_, StartString + Len(StringToSearch(i)) + 1, EndString - 1 - StartString - Len(StringToSearch(i)))

        End If
    ArrResults(i) = ResultString
    End If
Next i

'search zipcode and address
ArrResults(4) = regexp(ArrResults(1), "\b(\d{5})\b")
ArrResults(1) = regexp(ArrResults(1), "([a-z ]{2,}\s{0,1}\d{0,3})")

'your varabile
Name = ArrResults(0)
Address = ArrResults(1)
Telephone = ArrResults(2)
Email = ArrResults(3)
ZipCode = ArrResults(4)

Next ' Next/end-for
End Function

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM