简体   繁体   English

正则表达式先行查找以匹配第一组或第二组数字之前的所有内容

[英]Regex lookahead to match everything prior to 1st OR 2nd group of digits

Regex in VBA. VBA中的正则表达式。

I am using the following regex to match the second occurance of a 4-digit group, or the first group if there is only one group: 我正在使用以下正则表达式来匹配4位数字组的第二次出现,如果只有一个组,则匹配第一组:

\b\d{4}\b(?!.+\b\d{4}\b)

Now I need to do kind of the opposite: I need to match everything up until the second occurance of a 4-digit group, or up until the first group if there is only one. 现在,我需要做相反的事情:我需要匹配所有内容,直到出现第二个4位数组,或者如果只有一个,则匹配到第一组。 If there are no 4-digit groups, capture the entire string. 如果没有4位数字组,则捕获整个字符串。

This would be sufficient. 这样就足够了。

But there is also a preferable "bonus" route: If there exists a way to match everything up until a 4-digit group that is optionally followed by some random text, but only if there is no other 4-digit group following it. 但是,还有一条更可取的“加分”路线:如果存在一种方法来匹配所有内容,直到一个4位数字组,然后可选地跟随一些随机文本,但前提是后面没有其他4位数字组。 If there exists a second group of 4 digits, capture everything up until that group (including the first group and periods, but not commas). 如果存在第二组4位数字,则捕获直到该组为止的所有内容(包括第一组和句号,但不包括逗号)。 If there are no groups, capture everything. 如果没有组,则捕获所有内容。 If the line starts with a 4-digit group, capture nothing. 如果该行以4位数字组开头,则不捕获任何内容。

I understand that also this could (should?) be done with a lookahead, but I am not having any luck in figuring out how they work for this purpose. 我知道也可以(应该?)先行完成,但是我没有运气来弄清楚它们如何为此目的工作。

Examples: 例子:

Input: String.String String 4444  
Capture: String.String String 4444

Input: String4444 8888 String  
Capture: String4444

Input: String String 444 . B, 8888
Capture: String String 444 . B

Bonus case: 奖励情况:

Input: 8888 String  
Capture:   

for up until the second occurrence of a 4-digit group, or up until the first group if there is only one use this pattern 直到第二个出现一个四位数的组为止,或者直到只有一个使用第一个组为止,直到第一个出现为止

^((?:.*?\d{4})?.*?)(?=\s*\b\d{4}\b)

Demo 演示版


per comment below, use this pattern 根据下面的评论,使用此模式

^((?:.*?\d{4})?.*?(?=\s*\b\d{4}\b)|.*)

Demo 演示版

Matches everything except spaces till last occurace of a 4 digit word 匹配除空格外的所有内容,直到最后出现一个4位数字的单词

You can use the following: 您可以使用以下内容:

(?:(?! ).)+(?=.*\b\d{4}\b)

See DEMO 演示

For your basic case (marked by you as sufficient), this will work: 对于您的基本情况(由您标记为足够),这将起作用:

((?:(?!\d{4}).)*(?:\d{4})?(?:(?!\d{4}).)*)(?=\d{4})

You can pad every \\d{4} internally with \\b if you need to. 如果需要,可以在内部用\\b填充每个\\d{4}

See a demo here . 在此处查看演示。

You can use this regex in VBA to capture lines with 4-digit numbers, or those that do not have 4-digit numbers in them: 您可以在VBA中使用此正则表达式捕获具有4位数字的行或其中没有4位数字的行:

^((?:.*?[0-9]{4})?.*?(?=\s*?[0-9]{4})|(?!.*[0-9]{4}).*)

See demo , it should work the same in VBA. 参见demo ,它在VBA中应该可以正常工作。

The regex consists of 2 alternatives: (?:.*?[0-9]{4})?.*?(?=\\s*?[0-9]{4}) and (?!.*[0-9]{4}).* . 正则表达式由2个选择组成: (?:.*?[0-9]{4})?.*?(?=\\s*?[0-9]{4})(?!.*[0-9]{4}).*

(?:.*?[0-9]{4})?.*?(?=\\s*?[0-9]{4}) matches 0 or more (as few as possible) characters that are preceded by 0 or 1 sequence of characters followed by a 4-digit number, and are followed by optional space(s) and 4 digit number. (?:.*?[0-9]{4})?.*?(?=\\s*?[0-9]{4})匹配0个或更多(尽可能少)的字符0或1个字符序列,后跟一个4位数字,然后是可选的空格和4位数字。

(?!.*[0-9]{4}).* matches any number of any characters that do not have a 4-digit number inside. (?!.*[0-9]{4}).*匹配内部没有4位数字的任意数量的字符。

Note that to only match whole numbers (not part of other words) you need to add \\b around the [0-9]{4} patterns (ie \\b[0-9]{4}\\b ). 请注意,仅匹配整数(不包括其他部分),您需要在[0-9]{4}模式(即\\b[0-9]{4}\\b )周围添加\\b

If anyone is interested, I cheated to fully solve my problem. 如果有人感兴趣,我会作弊以完全解决我的问题。

Building on this answer , which solves the vast majority of my data set, I used program logic to catch some rarely seen use-cases. 基于此答案 (它解决了我的绝大多数数据集),我使用程序逻辑来捕获一些罕见的用例。 It seemed difficult to get a single regex to cover all the situations, so this seems like a viable alternative. 似乎很难获得一个正则表达式来涵盖所有情况,因此这似乎是一个可行的选择。

Problem is illustrated here . 问题在这里说明。

The code isn't bulletproof yet, but this is the gist: 该代码还不是防弹的,但这是要点:

Function cRegEx (str As String) As String

Dim rExp As Object, rMatch As Object, regP As String, strL() As String
regP = "^((?:.*?[0-9]{4})?.*?(?:(?=\s*[0-9]{4})|(?:(?!\d{4}).)*)|(?!.*[0-9]{4}).*)"

' Encountered two use-cases that weren't easily solvable with regex, due to the already complex pattern(s).
' Split str if we encounter a comma and only keep the first part - this way we don't have to solve this case in the regex.
If InStr(str, ",") <> 0 Then
    strL = Split(str, ",")
    str = strL(0)
End If

' If str starts with a 4-digit group, return an empty string.
If cRegExNum(str) = False Then
    Set rExp = CreateObject("vbscript.regexp")
    With rExp
        .Global = False
        .MultiLine = False
        .IgnoreCase = True
        .Pattern = regP
    End With

    Set rMatch = rExp.Execute(str)
    If rMatch.Count > 0 Then
        cRegEx = rMatch(0)
    Else
        cRegEx = ""
    End If
Else
    cRegEx = ""
End If
End Function


Function cRegExNum (str As String) As Boolean
' Does the string start with 4 non-whitespaced integers?
' Return true if it does
Dim rExp As Object, rMatch As Object, regP As String
regP = "^\d{4}"

Set rExp = CreateObject("vbscript.regexp")
With rExp
    .Global = False
    .MultiLine = False
    .IgnoreCase = True
    .Pattern = regP
End With

Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
    cRegExNum = True
Else
    cRegExNum = False
End If
End Function

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式不匹配任何组,第一组或第二组,但不能同时匹配。 像“NAND”这样的东西 - Regex match no groups, 1st group OR 2nd group but not both. Something like 'NAND' 正则表达式-第一组1次,第二组多次 - Regex - 1st group 1 time, 2nd group Multiple times 仅当没有第二组(包括变体)时才提取第一组+第二组或第一组的正则表达式 - a regex expression to extract 1st group + 2nd group or 1st group only if no 2nd group (including variations) 如何使用正则表达式 python 在换行符后取第一个和第二个逗号之间的所有内容? - How to take everything between 1st and 2nd comma after a newline using Regex python? 如何根据“ 3letters + 1st + 2nd + 4thdigit”对像“ 3letters + 4digits”这样的变量进行分组? - How to group a variable like '3letters+4digits' in terms of '3letters+ 1st + 2nd + 4thdigit'? 正则表达式在第二部分中找到字符串的第一部分? - Regex to find 1st part of string in 2nd part? 在 Java 中的字符串中查找第 1 次和第 2 次出现的正则表达式 - Finding the 1st and 2nd occurrence of a regex in a string in Java 正则表达式匹配数字与前瞻 - Regex match digits with lookahead 一个数字加上两个字符的正则表达式,如“1st”、“2nd”、“10th”、“22nd”? - Regex for a digit plus two characters like '1st', '2nd', '10th', '22nd'? 将文本拆分为第一个“;”然后再按第二个“;”拆分 - split text by 1st “;” and then again by 2nd “;”
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM