简体   繁体   English

是否可以改进此正则表达式以在 Excel 单元格中查找编号的文本行以避免错误匹配?

[英]Can this regex to find numbered lines of text in an Excel cell be improved to avoid false matches?

I have a large spreadsheet where some cells may contain many lines of text, some numbered, some not.我有一个很大的电子表格,其中一些单元格可能包含多行文本,有的有编号,有的没有。 My goal is to extract these individual numbered 'items' into separate cells.我的目标是将这些单独编号的“项目”提取到单独的单元格中。

For example, an input cell might contain something like this (in between the "s):例如,输入单元格可能包含类似这样的内容(在“s”之间):

"1. Party A complete. “1.甲方完成。
2./3. 2./3. Party B to construct as per drawing 805/12.乙方按图纸 805/12 施工。
Use ITP 675/24.使用 ITP 675/24。

4.Party C to be engaged." 4. 参与方 C。”

Note that an item number starts at the beginning of a line or follows one like this using a "/".请注意,项目编号从一行的开头开始,或者使用“/”跟在这样的后面。 Numbers are always followed by a "."数字后面总是跟一个“.”。 (dot). (点)。 There may be some or no spaces following the dot and the text for an item may then be spread over multiple lines.点后面可能有一些空格或没有空格,然后项目的文本可能会分布在多行中。

Operating on the above input cell, the desired output would be:在上述输入单元上运行,所需的 output 将是:

Cell 1: "1. Party A complete."单元格 1:“1. 甲方完成。”
Cell 2: "2. Party B to construct as per drawing 805/12.单元格 2:“2. B 方按照图纸 805/12 建造。
Use ITP 675/24."使用 ITP 675/24。”
Cell 3: "3. Party B to construct as per drawing 805/12.单元格 3:“3. 乙方按照图纸 805/12 建造。
Use ITP 675/24."使用 ITP 675/24。”
Cell 4: "4.Party C to be engaged."单元格 4:“4. 参与方 C。”

I have been using the RegExp class object in VBA as follows.我一直在 VBA 中使用 RegExp class object 如下。 This allows me to pinpoint the start of items and then extract the text in between these points (or end of string):这使我可以查明项目的开始,然后提取这些点之间的文本(或字符串的结尾):

Dim RegExObj1 As RegExp
Dim mc1 As MatchCollection

Set RegExObj1 = New RegExp

With RegExObj1
    .Global = True
    .IgnoreCase = True
    .MultiLine = True
    .Pattern = "(^|/)(\d+)\."
End With

Set mc1 = RegExObj1.Execute(CleanedCellText)

This generally works, but I get unwanted matches like "/12."这通常有效,但我得到了不需要的匹配项,例如“/12”。 and "/24.", from the ends of lines.和“/24.”,从行尾开始。 How can I change the regex to exclude these?如何更改正则表达式以排除这些?

Note that I capture the occurrence of "/" to determine if an item number needs to inherit the text from the next number up.请注意,我捕获“/”的出现以确定项目编号是否需要从下一个编号继承文本。 In this case item 2 inherits the text from item 3. But I'm not sure if there is a better way to manage this challenge.在这种情况下,项目 2 继承了项目 3 的文本。但我不确定是否有更好的方法来应对这一挑战。

Given your data, a pattern like (?:\d+\.\/)|(?:\d+\.[\s\S]+?(?=(?:\x0A+\d+\.)|$)) will collect both the start of each line (numbered segment), and the rest of the line (numbered segment).给定您的数据,类似(?:\d+\.\/)|(?:\d+\.[\s\S]+?(?=(?:\x0A+\d+\.)|$))将收集每条线的起点(编号段)和该线的 rest(编号段)。

If a line number is followed by ./ , it collects only that so you can tell if you need to fill up by testing if the rightmost character is a / .如果行号后跟./ ,它只会收集那个,因此您可以通过测试最右边的字符是否为/来判断是否需要填写。 After we populate the results array, we loop through it from bottom to top and decide where we need to fill in the blanks.在我们填充结果数组后,我们从下到上循环遍历它并决定我们需要在哪里填充空白。

So here is another approach, using regex.所以这是另一种方法,使用正则表达式。 As written, the formula returns a vertical array.如所写,该公式返回一个垂直数组。 If you have O365 with dynamic arrays, it will Spill the results.如果您有带有动态 arrays 的 O365,它将溢出结果。 If you don't, you can retrieve them either by entering the formula as an array formula over multiple cells, or using the Index function如果不这样做,您可以通过将公式作为数组公式输入多个单元格或使用索引 function 来检索它们

Option Explicit
Function foo(s) As String()
    Dim RE As RegExp, MC As MatchCollection, M As Match
    Const sPat As String = "(?:\d+\.\/)|(?:\d+\.[\s\S]+?(?=(?:\x0A+\d+\.)|$))"
    Dim sTemp() As String, I As Long
    
Set RE = New RegExp
With RE
    .Global = True
    .MultiLine = False
    .Pattern = sPat
    If .Test(s) = True Then
        Set MC = .Execute(s)
        ReDim sTemp(1 To MC.Count, 1 To 1) '2D array for vertical results
        I = 0
        For Each M In MC
            I = I + 1
            sTemp(I, 1) = M
        Next M
    End If
    
    For I = UBound(sTemp, 1) - 1 To LBound(sTemp, 1) Step -1
        If Right(sTemp(I, 1), 1) = "/" Then
            sTemp(I, 1) = Replace(sTemp(I, 1), "/", "") & Mid(sTemp(I + 1, 1), InStr(sTemp(I + 1, 1), ".") + 1, 999)
        End If
    Next I
        
    foo = sTemp
    
End With
    
End Function

在此处输入图像描述

Regex Explanation正则表达式解释

Extract Lines提取线

(?:\d+\.\/)|(?:\d+\.[\s\S]+?(?=(?:\x0A+\d+\.)|$))

Options: ^$ don't match at line breaks选项:^$ 在换行符处不匹配

Created with RegexBuddy使用RegexBuddy创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM