简体   繁体   English

在 VBA 中使用正则表达式拆分块

[英]Split blocks using regex in VBA

I have data and I need to split each block so as to store each block in separate row.我有数据,我需要拆分每个块,以便将每个块存储在单独的行中。 The entire text looks like:整个文本如下所示:

الصف الأول
ابراهيم وليد ابراهيم ابوالحمد
منافذ فورى
احمد ابوالريش فرج عبدالله
منافذ فورى
اسلام فتحى محمد ناجى
منافذ فورى
اسلام نصار حسين نصار حسين عبد الونيس
منافذ فورى
ايمن راضى صالح سلومه
منافذ فورى
بسمه محمد ابراهيم ظدم
منافذ فورى
بشار نصر الله مصوف السايب
منافذ فورى

I used https://regex101.com/ and I could define the start of each block like that我使用了 https://regex101.com/我可以像这样定义每个块的开始


This highlights the start of each block这突出了每个块的开始

How can I split and separate each block >> and each block has to be in one row?如何拆分和分隔每个块>>并且每个块必须在一行中?

Here's the HTML for the whole page: https://pastebin.com/nu0dLvch这是整个页面的HTML:https://pastebin.com/nu0dLvch

Here's a link of the full data: https://pastebin.com/dWcu97Wt这是完整数据的链接: https://pastebin.com/dWcu97Wt

I would highlight the needed parts(these are the groups to match).我会突出显示所需的部分(这些是要匹配的组)。 Starting with...从...开始...

ending with...以……结尾

There are 22 blocks of data (groups) in total.总共有22个数据块(组)。

Looking at the regex provided by @Wiktor Stribiżew in comments: https://regex101.com/r/dmCNuH/1查看@Wiktor Stribiżew 在评论中提供的正则表达式: https://regex101.com/r/dmCNuH/1

match 11 is the first real needed data (match group) though truncates the final line. match 11 是第一个真正需要的数据(匹配组),但会截断最后一行。


After the amazing pattern I got it from Wiktor, I tried to get all the matches在我从 Wiktor 得到了惊人的模式之后,我试图得到所有的匹配项

Sub Test()
    Dim a(), s As String, i As Long, j As Long
        Dim bot As New ChromeDriver
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        s = .FindElementByCss("table[id='all']").Text

    End With
        a = GetMatches(s, "^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)+")
        For i = LBound(a) To UBound(a)
            Debug.Print a(i)
        Next i
End Sub

Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As Variant
    Dim arrMatches(), matches As Object, iMatch As Object, s As String, i As Long
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                arrMatches(i) = iMatch.SubMatches.Item(0)
                i = i + 1
            Next iMatch
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

But this doesn't work for me and throws an error.但这对我不起作用并引发错误。

You may use您可以使用


See the regex demo .请参阅正则表达式演示 Use with .Global = True and .MultiLine = True options, you do not need to set .IgnoreCase to True .与 .Global = .Global = True.MultiLine = True选项一起使用,您不需要将.IgnoreCase设置为True

NOTE : Since \r , carriage return, is used inside Excel cell values to define a line break, you may need to replace all \n chars in the pattern with \r .注意:由于在 Excel 单元格值中使用\r (回车)来定义换行符,因此您可能需要将模式中的所有\n字符替换为\r

The regex matches a line that may be indented or not and contains 1, 2 or 3 digits, and then grabs the next four lines that do not match the initial pattern.正则表达式匹配可能缩进或不缩进并包含 1、2 或 3 位数字的行,然后抓取与初始模式不匹配的接下来的四行。

More details更多细节

  • ^ - start of a line ^ - 行首
  • \s* - 0 or more whitespace characters \s* - 0 个或更多空白字符
  • \d{1,3} - one to three digits \d{1,3} - 一到三位数
  • (?:\n(?,\s*\d{1.3}\n).*){4} - a non-capturing group matching four ( {4} ) occurrences of (?:\n(?,\s*\d{1.3}\n).*){4} - 匹配四个 ( {4} ) 出现的非捕获组
    • \n - a newline character ( \n ) that is... \n - 换行符( \n ),即...
    • (?,\s*\d{1,3}\n) - ( negative lookahead ) not immediately followed with: (?,\s*\d{1,3}\n) - (负前瞻)没有立即跟随:
      • \s* - 0 or more whitespaces \s* - 0 个或更多空格
      • \d{1,3} - one, two or three digits \d{1,3} - 一位、两位或三位数字
      • \n - a newline char \n - 换行符
    • .* - any 0 or more characters other than line break characters, as many as possible. .* - 除换行符之外的任何 0 个或多个字符,尽可能多。

To extract detailed information with groups , you may use要提取组的详细信息,您可以使用


See this regex demo看到这个正则表达式演示

  • ^ - start of string ^ - 字符串的开头
  • [^\S\n]* - 0 or more whitespace characters other than a newline char [^\S\n]* - 0 个或多个空格字符,换行符除外
  • (\d{1,3}) - one to three digits (\d{1,3}) - 一到三位数
  • \n - a newline \n - 换行符
  • \s* - any 0+ whitespaces \s* - 任何 0+ 个空格
  • (\d{6,}) - Group 2: (\d{6,}) - 第 2 组:
  • [^\S\n]*\n\s* - 0 or more whitespace characters other than a newline char, a newline and then any 0 or more whitespaces [^\S\n]*\n\s* - 除了换行符、换行符和任何 0 个或多个空格之外,还有 0 个或多个空格字符
  • (\d{14}) - Group 3: fourteen digits (\d{14}) - 第 3 组:十四位数
  • [^\S\n]*\n - 0 or more whitespace characters other than a newline char and a newline char [^\S\n]*\n - 除了换行符和换行符之外的 0 个或更多空白字符
  • (.+) - Group 4: any one or more characters other than line break chars, as many as possible (.+) - 第 4 组:除换行符以外的任何一个或多个字符,尽可能多
  • \n - a newline \n - 换行符
  • (.+) - Group 5: any one or more characters other than line break chars, as many as possible (.+) - 第 5 组:除换行符以外的任何一个或多个字符,尽可能多

Thanks a lot for Wiktor and QHarr for helping me a lot with this issue.非常感谢 Wiktor 和 QHarr 在这个问题上帮助我很多。 I appreciate a lot their help.我非常感谢他们的帮助。 Here is the final code and I welcome any other ideas or modifications to the code这是最终代码,我欢迎对代码进行任何其他想法或修改

Sub Test()
    Dim x, a(1 To 1000, 1 To 5), bot As New ChromeDriver, col As Object, sInput As String, sPattern As String, i As Long, j As Long, cnt As Long
    sPattern = "^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}"
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        sInput = .FindElementByCss("table[id='all']").Text
    End With
    With CreateObject("VBScript.RegExp")
        .Global = True: .MultiLine = True: .IgnoreCase = True
        .Pattern = sPattern
        If .Test(sInput) Then
            Set col = .Execute(sInput)
            For i = 0 To col.Count - 1
                x = Split(col.Item(i), vbLf)
                cnt = cnt + 1
                For j = LBound(x) To UBound(x)
                    a(i + 1, j + 1) = Application.WorksheetFunction.Clean(Trim(x(j)))
                Next j
            Next i
        End If
    End With
    ActiveSheet.Range("A1").Resize(cnt, UBound(a, 2)).Value = a
End Sub

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM