在 VBA 中使用正则表达式拆分块

Question

I have data and I need to split each block so as to store each block in separate row.我有数据，我需要拆分每个块，以便将每个块存储在单独的行中。 The entire text looks like:整个文本如下所示：

م
مطروح
الحمام
school
الصف
:
الصف الأول
 1
 458316219 
 30709101600371 
ابراهيم وليد ابراهيم ابوالحمد
منافذ فورى
 2
 458361688 
 30702263300318 
احمد ابوالريش فرج عبدالله
منافذ فورى
 3
 458312720 
 30703143300418 
اسلام فتحى محمد ناجى
منافذ فورى
 4
 458790904 
 30606101802299 
اسلام نصار حسين نصار حسين عبد الونيس
منافذ فورى
 5
 458312908 
 30612013300259 
ايمن راضى صالح سلومه
منافذ فورى
 6
 458884564 
 30802203300186 
بسمه محمد ابراهيم ظدم
منافذ فورى
 7
 477625786 
 30708263300235 
بشار نصر الله مصوف السايب
منافذ فورى

I used https://regex101.com/ and I could define the start of each block like that我使用了 https://regex101.com/我可以像这样定义每个块的开始

\d{1,3}\n

This highlights the start of each block这突出了每个块的开始

How can I split and separate each block >> and each block has to be in one row?如何拆分和分隔每个块>>并且每个块必须在一行中？

Here's the HTML for the whole page: https://pastebin.com/nu0dLvch这是整个页面的HTML：https://pastebin.com/nu0dLvch

Here's a link of the full data: https://pastebin.com/dWcu97Wt这是完整数据的链接： https://pastebin.com/dWcu97Wt

I would highlight the needed parts(these are the groups to match).我会突出显示所需的部分（这些是要匹配的组）。 Starting with...从...开始...

ending with...以……结尾

There are 22 blocks of data (groups) in total.总共有22个数据块（组）。

Looking at the regex provided by @Wiktor Stribiżew in comments: https://regex101.com/r/dmCNuH/1查看@Wiktor Stribiżew 在评论中提供的正则表达式： https://regex101.com/r/dmCNuH/1

match 11 is the first real needed data (match group) though truncates the final line. match 11 是第一个真正需要的数据（匹配组），但会截断最后一行。

After the amazing pattern I got it from Wiktor, I tried to get all the matches在我从 Wiktor 得到了惊人的模式之后，我试图得到所有的匹配项

Sub Test()
    Dim a(), s As String, i As Long, j As Long
        Dim bot As New ChromeDriver
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        s = .FindElementByCss("table[id='all']").Text

    End With
        a = GetMatches(s, "^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)+")
        For i = LBound(a) To UBound(a)
            Debug.Print a(i)
        Next i
End Sub

Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As Variant
    Dim arrMatches(), matches As Object, iMatch As Object, s As String, i As Long
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                arrMatches(i) = iMatch.SubMatches.Item(0)
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

But this doesn't work for me and throws an error.但这对我不起作用并引发错误。

Answer 1

You may use您可以使用

^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}

See the regex demo .请参阅正则表达式演示。 Use with .Global = True and .MultiLine = True options, you do not need to set .IgnoreCase to True .与 .Global = .Global = True和.MultiLine = True选项一起使用，您不需要将.IgnoreCase设置为True 。

NOTE : Since \r , carriage return, is used inside Excel cell values to define a line break, you may need to replace all \n chars in the pattern with \r .注意：由于在 Excel 单元格值中使用\r （回车）来定义换行符，因此您可能需要将模式中的所有\n字符替换为\r 。

The regex matches a line that may be indented or not and contains 1, 2 or 3 digits, and then grabs the next four lines that do not match the initial pattern.正则表达式匹配可能缩进或不缩进并包含 1、2 或 3 位数字的行，然后抓取与初始模式不匹配的接下来的四行。

More details更多细节

^ - start of a line ^ - 行首
\s* - 0 or more whitespace characters \s* - 0 个或更多空白字符
\d{1,3} - one to three digits \d{1,3} - 一到三位数
(?:\n(?,\s*\d{1.3}\n).*){4} - a non-capturing group matching four ( {4} ) occurrences of (?:\n(?,\s*\d{1.3}\n).*){4} - 匹配四个 ( {4} ) 出现的非捕获组
- \n - a newline character ( \n ) that is... \n - 换行符（ \n ），即...
- (?,\s*\d{1,3}\n) - ( negative lookahead ) not immediately followed with: (?,\s*\d{1,3}\n) - （负前瞻）没有立即跟随：
  - \s* - 0 or more whitespaces \s* - 0 个或更多空格
  - \d{1,3} - one, two or three digits \d{1,3} - 一位、两位或三位数字
  - \n - a newline char \n - 换行符
- .* - any 0 or more characters other than line break characters, as many as possible. .* - 除换行符之外的任何 0 个或多个字符，尽可能多。

To extract detailed information with groups , you may use要提取组的详细信息，您可以使用

^[^\S\n]*(\d{1,3})\n\s*(\d{6,})[^\S\n]*\n\s*(\d{14})[^\S\n]*\n(.+)\n(.+)

See this regex demo看到这个正则表达式演示

^ - start of string ^ - 字符串的开头
[^\S\n]* - 0 or more whitespace characters other than a newline char [^\S\n]* - 0 个或多个空格字符，换行符除外
(\d{1,3}) - one to three digits (\d{1,3}) - 一到三位数
\n - a newline \n - 换行符
\s* - any 0+ whitespaces \s* - 任何 0+ 个空格
(\d{6,}) - Group 2: (\d{6,}) - 第 2 组：
[^\S\n]*\n\s* - 0 or more whitespace characters other than a newline char, a newline and then any 0 or more whitespaces [^\S\n]*\n\s* - 除了换行符、换行符和任何 0 个或多个空格之外，还有 0 个或多个空格字符
(\d{14}) - Group 3: fourteen digits (\d{14}) - 第 3 组：十四位数
[^\S\n]*\n - 0 or more whitespace characters other than a newline char and a newline char [^\S\n]*\n - 除了换行符和换行符之外的 0 个或更多空白字符
(.+) - Group 4: any one or more characters other than line break chars, as many as possible (.+) - 第 4 组：除换行符以外的任何一个或多个字符，尽可能多
\n - a newline \n - 换行符
(.+) - Group 5: any one or more characters other than line break chars, as many as possible (.+) - 第 5 组：除换行符以外的任何一个或多个字符，尽可能多

Answer 2

Thanks a lot for Wiktor and QHarr for helping me a lot with this issue.非常感谢 Wiktor 和 QHarr 在这个问题上帮助我很多。 I appreciate a lot their help.我非常感谢他们的帮助。 Here is the final code and I welcome any other ideas or modifications to the code这是最终代码，我欢迎对代码进行任何其他想法或修改

Sub Test()
    Dim x, a(1 To 1000, 1 To 5), bot As New ChromeDriver, col As Object, sInput As String, sPattern As String, i As Long, j As Long, cnt As Long
    sPattern = "^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}"
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        sInput = .FindElementByCss("table[id='all']").Text
    End With
    With CreateObject("VBScript.RegExp")
        .Global = True: .MultiLine = True: .IgnoreCase = True
        .Pattern = sPattern
        If .Test(sInput) Then
            Set col = .Execute(sInput)
            For i = 0 To col.Count - 1
                x = Split(col.Item(i), vbLf)
                cnt = cnt + 1
                For j = LBound(x) To UBound(x)
                    a(i + 1, j + 1) = Application.WorksheetFunction.Clean(Trim(x(j)))
                Next j
            Next i
        End If
    End With
    ActiveSheet.Range("A1").Resize(cnt, UBound(a, 2)).Value = a
End Sub

在 VBA 中使用正则表达式拆分块

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-06-15 10:00:43

解决方案2
1 2020-06-15 07:39:51

在 VBA 中使用正则表达式拆分块

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-06-15 10:00:43

解决方案2 1 2020-06-15 07:39:51

解决方案1
2 已采纳 2020-06-15 10:00:43

解决方案2
1 2020-06-15 07:39:51