简体   繁体   English

在 VBA 中使用正则表达式拆分块

[英]Split blocks using regex in VBA

I have data and I need to split each block so as to store each block in separate row.我有数据,我需要拆分每个块,以便将每个块存储在单独的行中。 The entire text looks like:整个文本如下所示:

م
مطروح
الحمام
school
الصف
:
الصف الأول
 1
 458316219 
 30709101600371 
ابراهيم وليد ابراهيم ابوالحمد
منافذ فورى
 2
 458361688 
 30702263300318 
احمد ابوالريش فرج عبدالله
منافذ فورى
 3
 458312720 
 30703143300418 
اسلام فتحى محمد ناجى
منافذ فورى
 4
 458790904 
 30606101802299 
اسلام نصار حسين نصار حسين عبد الونيس
منافذ فورى
 5
 458312908 
 30612013300259 
ايمن راضى صالح سلومه
منافذ فورى
 6
 458884564 
 30802203300186 
بسمه محمد ابراهيم ظدم
منافذ فورى
 7
 477625786 
 30708263300235 
بشار نصر الله مصوف السايب
منافذ فورى

I used https://regex101.com/ and I could define the start of each block like that我使用了 https://regex101.com/我可以像这样定义每个块的开始

\d{1,3}\n

This highlights the start of each block这突出了每个块的开始

How can I split and separate each block >> and each block has to be in one row?如何拆分和分隔每个块>>并且每个块必须在一行中?

Here's the HTML for the whole page: https://pastebin.com/nu0dLvch这是整个页面的HTML:https://pastebin.com/nu0dLvch

Here's a link of the full data: https://pastebin.com/dWcu97Wt这是完整数据的链接: https://pastebin.com/dWcu97Wt

I would highlight the needed parts(these are the groups to match).我会突出显示所需的部分(这些是要匹配的组)。 Starting with...从...开始...

ending with...以……结尾

There are 22 blocks of data (groups) in total.总共有22个数据块(组)。

Looking at the regex provided by @Wiktor Stribiżew in comments: https://regex101.com/r/dmCNuH/1查看@Wiktor Stribiżew 在评论中提供的正则表达式: https://regex101.com/r/dmCNuH/1

match 11 is the first real needed data (match group) though truncates the final line. match 11 是第一个真正需要的数据(匹配组),但会截断最后一行。

在此处输入图像描述

After the amazing pattern I got it from Wiktor, I tried to get all the matches在我从 Wiktor 得到了惊人的模式之后,我试图得到所有的匹配项

Sub Test()
    Dim a(), s As String, i As Long, j As Long
        Dim bot As New ChromeDriver
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        s = .FindElementByCss("table[id='all']").Text

    End With
        a = GetMatches(s, "^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)+")
        For i = LBound(a) To UBound(a)
            Debug.Print a(i)
        Next i
End Sub

Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As Variant
    Dim arrMatches(), matches As Object, iMatch As Object, s As String, i As Long
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                arrMatches(i) = iMatch.SubMatches.Item(0)
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

But this doesn't work for me and throws an error.但这对我不起作用并引发错误。

You may use您可以使用

^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}

See the regex demo .请参阅正则表达式演示 Use with .Global = True and .MultiLine = True options, you do not need to set .IgnoreCase to True .与 .Global = .Global = True.MultiLine = True选项一起使用,您不需要将.IgnoreCase设置为True

NOTE : Since \r , carriage return, is used inside Excel cell values to define a line break, you may need to replace all \n chars in the pattern with \r .注意:由于在 Excel 单元格值中使用\r (回车)来定义换行符,因此您可能需要将模式中的所有\n字符替换为\r

The regex matches a line that may be indented or not and contains 1, 2 or 3 digits, and then grabs the next four lines that do not match the initial pattern.正则表达式匹配可能缩进或不缩进并包含 1、2 或 3 位数字的行,然后抓取与初始模式不匹配的接下来的四行。

More details更多细节

  • ^ - start of a line ^ - 行首
  • \s* - 0 or more whitespace characters \s* - 0 个或更多空白字符
  • \d{1,3} - one to three digits \d{1,3} - 一到三位数
  • (?:\n(?,\s*\d{1.3}\n).*){4} - a non-capturing group matching four ( {4} ) occurrences of (?:\n(?,\s*\d{1.3}\n).*){4} - 匹配四个 ( {4} ) 出现的非捕获组
    • \n - a newline character ( \n ) that is... \n - 换行符( \n ),即...
    • (?,\s*\d{1,3}\n) - ( negative lookahead ) not immediately followed with: (?,\s*\d{1,3}\n) - (负前瞻)没有立即跟随:
      • \s* - 0 or more whitespaces \s* - 0 个或更多空格
      • \d{1,3} - one, two or three digits \d{1,3} - 一位、两位或三位数字
      • \n - a newline char \n - 换行符
    • .* - any 0 or more characters other than line break characters, as many as possible. .* - 除换行符之外的任何 0 个或多个字符,尽可能多。

To extract detailed information with groups , you may use要提取组的详细信息,您可以使用

^[^\S\n]*(\d{1,3})\n\s*(\d{6,})[^\S\n]*\n\s*(\d{14})[^\S\n]*\n(.+)\n(.+)

See this regex demo看到这个正则表达式演示

  • ^ - start of string ^ - 字符串的开头
  • [^\S\n]* - 0 or more whitespace characters other than a newline char [^\S\n]* - 0 个或多个空格字符,换行符除外
  • (\d{1,3}) - one to three digits (\d{1,3}) - 一到三位数
  • \n - a newline \n - 换行符
  • \s* - any 0+ whitespaces \s* - 任何 0+ 个空格
  • (\d{6,}) - Group 2: (\d{6,}) - 第 2 组:
  • [^\S\n]*\n\s* - 0 or more whitespace characters other than a newline char, a newline and then any 0 or more whitespaces [^\S\n]*\n\s* - 除了换行符、换行符和任何 0 个或多个空格之外,还有 0 个或多个空格字符
  • (\d{14}) - Group 3: fourteen digits (\d{14}) - 第 3 组:十四位数
  • [^\S\n]*\n - 0 or more whitespace characters other than a newline char and a newline char [^\S\n]*\n - 除了换行符和换行符之外的 0 个或更多空白字符
  • (.+) - Group 4: any one or more characters other than line break chars, as many as possible (.+) - 第 4 组:除换行符以外的任何一个或多个字符,尽可能多
  • \n - a newline \n - 换行符
  • (.+) - Group 5: any one or more characters other than line break chars, as many as possible (.+) - 第 5 组:除换行符以外的任何一个或多个字符,尽可能多

Thanks a lot for Wiktor and QHarr for helping me a lot with this issue.非常感谢 Wiktor 和 QHarr 在这个问题上帮助我很多。 I appreciate a lot their help.我非常感谢他们的帮助。 Here is the final code and I welcome any other ideas or modifications to the code这是最终代码,我欢迎对代码进行任何其他想法或修改

Sub Test()
    Dim x, a(1 To 1000, 1 To 5), bot As New ChromeDriver, col As Object, sInput As String, sPattern As String, i As Long, j As Long, cnt As Long
    sPattern = "^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}"
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        sInput = .FindElementByCss("table[id='all']").Text
    End With
    With CreateObject("VBScript.RegExp")
        .Global = True: .MultiLine = True: .IgnoreCase = True
        .Pattern = sPattern
        If .Test(sInput) Then
            Set col = .Execute(sInput)
            For i = 0 To col.Count - 1
                x = Split(col.Item(i), vbLf)
                cnt = cnt + 1
                For j = LBound(x) To UBound(x)
                    a(i + 1, j + 1) = Application.WorksheetFunction.Clean(Trim(x(j)))
                Next j
            Next i
        End If
    End With
    ActiveSheet.Range("A1").Resize(cnt, UBound(a, 2)).Value = a
End Sub

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM