简体   繁体   中英

Regex lookahead to match everything prior to 1st OR 2nd group of digits

Regex in VBA.

I am using the following regex to match the second occurance of a 4-digit group, or the first group if there is only one group:

\b\d{4}\b(?!.+\b\d{4}\b)

Now I need to do kind of the opposite: I need to match everything up until the second occurance of a 4-digit group, or up until the first group if there is only one. If there are no 4-digit groups, capture the entire string.

This would be sufficient.

But there is also a preferable "bonus" route: If there exists a way to match everything up until a 4-digit group that is optionally followed by some random text, but only if there is no other 4-digit group following it. If there exists a second group of 4 digits, capture everything up until that group (including the first group and periods, but not commas). If there are no groups, capture everything. If the line starts with a 4-digit group, capture nothing.

I understand that also this could (should?) be done with a lookahead, but I am not having any luck in figuring out how they work for this purpose.

Examples:

Input: String.String String 4444  
Capture: String.String String 4444

Input: String4444 8888 String  
Capture: String4444

Input: String String 444 . B, 8888
Capture: String String 444 . B

Bonus case:

Input: 8888 String  
Capture:   

for up until the second occurrence of a 4-digit group, or up until the first group if there is only one use this pattern

^((?:.*?\d{4})?.*?)(?=\s*\b\d{4}\b)

Demo


per comment below, use this pattern

^((?:.*?\d{4})?.*?(?=\s*\b\d{4}\b)|.*)

Demo

Matches everything except spaces till last occurace of a 4 digit word

You can use the following:

(?:(?! ).)+(?=.*\b\d{4}\b)

See DEMO

For your basic case (marked by you as sufficient), this will work:

((?:(?!\d{4}).)*(?:\d{4})?(?:(?!\d{4}).)*)(?=\d{4})

You can pad every \\d{4} internally with \\b if you need to.

See a demo here .

You can use this regex in VBA to capture lines with 4-digit numbers, or those that do not have 4-digit numbers in them:

^((?:.*?[0-9]{4})?.*?(?=\s*?[0-9]{4})|(?!.*[0-9]{4}).*)

See demo , it should work the same in VBA.

The regex consists of 2 alternatives: (?:.*?[0-9]{4})?.*?(?=\\s*?[0-9]{4}) and (?!.*[0-9]{4}).* .

(?:.*?[0-9]{4})?.*?(?=\\s*?[0-9]{4}) matches 0 or more (as few as possible) characters that are preceded by 0 or 1 sequence of characters followed by a 4-digit number, and are followed by optional space(s) and 4 digit number.

(?!.*[0-9]{4}).* matches any number of any characters that do not have a 4-digit number inside.

Note that to only match whole numbers (not part of other words) you need to add \\b around the [0-9]{4} patterns (ie \\b[0-9]{4}\\b ).

If anyone is interested, I cheated to fully solve my problem.

Building on this answer , which solves the vast majority of my data set, I used program logic to catch some rarely seen use-cases. It seemed difficult to get a single regex to cover all the situations, so this seems like a viable alternative.

Problem is illustrated here .

The code isn't bulletproof yet, but this is the gist:

Function cRegEx (str As String) As String

Dim rExp As Object, rMatch As Object, regP As String, strL() As String
regP = "^((?:.*?[0-9]{4})?.*?(?:(?=\s*[0-9]{4})|(?:(?!\d{4}).)*)|(?!.*[0-9]{4}).*)"

' Encountered two use-cases that weren't easily solvable with regex, due to the already complex pattern(s).
' Split str if we encounter a comma and only keep the first part - this way we don't have to solve this case in the regex.
If InStr(str, ",") <> 0 Then
    strL = Split(str, ",")
    str = strL(0)
End If

' If str starts with a 4-digit group, return an empty string.
If cRegExNum(str) = False Then
    Set rExp = CreateObject("vbscript.regexp")
    With rExp
        .Global = False
        .MultiLine = False
        .IgnoreCase = True
        .Pattern = regP
    End With

    Set rMatch = rExp.Execute(str)
    If rMatch.Count > 0 Then
        cRegEx = rMatch(0)
    Else
        cRegEx = ""
    End If
Else
    cRegEx = ""
End If
End Function


Function cRegExNum (str As String) As Boolean
' Does the string start with 4 non-whitespaced integers?
' Return true if it does
Dim rExp As Object, rMatch As Object, regP As String
regP = "^\d{4}"

Set rExp = CreateObject("vbscript.regexp")
With rExp
    .Global = False
    .MultiLine = False
    .IgnoreCase = True
    .Pattern = regP
End With

Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
    cRegExNum = True
Else
    cRegExNum = False
End If
End Function

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM