简体   繁体   中英

VBA regex: extract multiple strings between strings within Excel cell with custom function

Within an Excel column I have data such as:

"Audi (ADI), Mercedes (modelx) (MEX), Ferrari super fast, high PS (FEH)"

There hundreds of models that are described by a name and an abbreviation of three capitalized letters in brackets.

I need to extract the names only and the abbreviations only to separate cells. I succeeded doing this for the abbreviations by the following module:

Function extrABR(cellRef) As String
    Dim RE As Object, MC As Object, M As Object
    Dim sTemp As Variant
    Const sPat As String = "([A-Z][A-Z][A-Z][A-Z]?)"  ' this is my regex to match my string
    
    
Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .MultiLine = True
    .Pattern = sPat
    If .Test(cellRef) Then
        Set MC = .Execute(cellRef)
        For Each M In MC
            sTemp = sTemp & ", " & M.SubMatches(0)
        Next M
    End If
End With

extrABR = Mid(sTemp, 3)

End Function 

However, I do not manage to do so for names . I thought of just exchanging the regex by the following regex: (^(.*?)(?= \\([AZ][AZ][AZ])|(?<=, )(.*)(?= \\([AZ][AZ][AZ])) , but VBA does not seem to allow lookbehind.

Any idea?

Correct, lookbehinds are not supported, but they are only necessary when your expected matches overlap. It is not the case here, all your matches are non-overlapping. So, you can again rely on capturing :

(?:^|,)\s*(.*?)(?=\s*\([A-Z]{3,}\))

See the regex demo . Group 1 values are accessed via .Submatches(0) .

Details :

  • (?:^|,) - either start of a string or a comma
  • \\s* - zero or more whitespace chars
  • (.*?) - Capturing group 1: any zero or more chars other than line break chars as few as possible
  • (?=\\s*\\([AZ]{3,}\\)) - a positive lookahead that matches a location that is immediately followed with
    • \\s* - zero or more whitespace chars
    • \\( - a ( char
    • [AZ]{3,} - three or more uppercase chars
    • \\) - a ) char.

Demo screenshot:

在此处输入图片说明

RE.REPLACE -- Try this function.. anything between the parenthesis will be replaced with "" giving you string of model names only, which you can then split on comma and get string array if so desired.

Function ModelNames(cellRef) As String
    Dim RE As Object, MC As Object, M As Object
    Dim sTemp As Variant, sPat As String
    sPat = "\([^)]+\)"
'Or you can use your formula pattern "([A-Z][A-Z][A-Z][A-Z]?)" to get (modelx)  in the final output.

Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .MultiLine = True
    .Pattern = sPat
End With

ModelNames = RE.Replace(cellRef, "")

End Function

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM