[英]Split blocks using regex in VBA
I have data and I need to split each block so as to store each block in separate row.我有数据,我需要拆分每个块,以便将每个块存储在单独的行中。 The entire text looks like:
整个文本如下所示:
م
مطروح
الحمام
school
الصف
:
الصف الأول
1
458316219
30709101600371
ابراهيم وليد ابراهيم ابوالحمد
منافذ فورى
2
458361688
30702263300318
احمد ابوالريش فرج عبدالله
منافذ فورى
3
458312720
30703143300418
اسلام فتحى محمد ناجى
منافذ فورى
4
458790904
30606101802299
اسلام نصار حسين نصار حسين عبد الونيس
منافذ فورى
5
458312908
30612013300259
ايمن راضى صالح سلومه
منافذ فورى
6
458884564
30802203300186
بسمه محمد ابراهيم ظدم
منافذ فورى
7
477625786
30708263300235
بشار نصر الله مصوف السايب
منافذ فورى
I used https://regex101.com/ and I could define the start of each block like that我使用了 https://regex101.com/我可以像这样定义每个块的开始
\d{1,3}\n
This highlights the start of each block这突出了每个块的开始
How can I split and separate each block >> and each block has to be in one row?如何拆分和分隔每个块>>并且每个块必须在一行中?
Here's the HTML for the whole page: https://pastebin.com/nu0dLvch这是整个页面的HTML:https://pastebin.com/nu0dLvch
Here's a link of the full data: https://pastebin.com/dWcu97Wt这是完整数据的链接: https://pastebin.com/dWcu97Wt
I would highlight the needed parts(these are the groups to match).我会突出显示所需的部分(这些是要匹配的组)。 Starting with...
从...开始...
ending with...以……结尾
There are 22 blocks of data (groups) in total.总共有22个数据块(组)。
Looking at the regex provided by @Wiktor Stribiżew in comments: https://regex101.com/r/dmCNuH/1查看@Wiktor Stribiżew 在评论中提供的正则表达式: https://regex101.com/r/dmCNuH/1
match 11 is the first real needed data (match group) though truncates the final line. match 11 是第一个真正需要的数据(匹配组),但会截断最后一行。
After the amazing pattern I got it from Wiktor, I tried to get all the matches在我从 Wiktor 得到了惊人的模式之后,我试图得到所有的匹配项
Sub Test()
Dim a(), s As String, i As Long, j As Long
Dim bot As New ChromeDriver
With bot
.AddArgument "--headless"
.Get "file:///C:\Sample.html"
s = .FindElementByCss("table[id='all']").Text
End With
a = GetMatches(s, "^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)+")
For i = LBound(a) To UBound(a)
Debug.Print a(i)
Next i
End Sub
Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As Variant
Dim arrMatches(), matches As Object, iMatch As Object, s As String, i As Long
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .Test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
arrMatches(i) = iMatch.SubMatches.Item(0)
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
But this doesn't work for me and throws an error.但这对我不起作用并引发错误。
You may use您可以使用
^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}
See the regex demo .请参阅正则表达式演示。 Use with
.Global = True
and .MultiLine = True
options, you do not need to set .IgnoreCase
to True
.与 .Global =
.Global = True
和.MultiLine = True
选项一起使用,您不需要将.IgnoreCase
设置为True
。
NOTE : Since \r
, carriage return, is used inside Excel cell values to define a line break, you may need to replace all \n
chars in the pattern with \r
.注意:由于在 Excel 单元格值中使用
\r
(回车)来定义换行符,因此您可能需要将模式中的所有\n
字符替换为\r
。
The regex matches a line that may be indented or not and contains 1, 2 or 3 digits, and then grabs the next four lines that do not match the initial pattern.正则表达式匹配可能缩进或不缩进并包含 1、2 或 3 位数字的行,然后抓取与初始模式不匹配的接下来的四行。
More details更多细节
^
- start of a line ^
- 行首\s*
- 0 or more whitespace characters \s*
- 0 个或更多空白字符\d{1,3}
- one to three digits \d{1,3}
- 一到三位数(?:\n(?,\s*\d{1.3}\n).*){4}
- a non-capturing group matching four ( {4}
) occurrences of (?:\n(?,\s*\d{1.3}\n).*){4}
- 匹配四个 ( {4}
) 出现的非捕获组
\n
- a newline character ( \n
) that is... \n
- 换行符( \n
),即...(?,\s*\d{1,3}\n)
- ( negative lookahead ) not immediately followed with: (?,\s*\d{1,3}\n)
- (负前瞻)没有立即跟随:
\s*
- 0 or more whitespaces \s*
- 0 个或更多空格\d{1,3}
- one, two or three digits \d{1,3}
- 一位、两位或三位数字\n
- a newline char \n
- 换行符.*
- any 0 or more characters other than line break characters, as many as possible. .*
- 除换行符之外的任何 0 个或多个字符,尽可能多。 To extract detailed information with groups , you may use要提取组的详细信息,您可以使用
^[^\S\n]*(\d{1,3})\n\s*(\d{6,})[^\S\n]*\n\s*(\d{14})[^\S\n]*\n(.+)\n(.+)
See this regex demo看到这个正则表达式演示
^
- start of string ^
- 字符串的开头[^\S\n]*
- 0 or more whitespace characters other than a newline char [^\S\n]*
- 0 个或多个空格字符,换行符除外(\d{1,3})
- one to three digits (\d{1,3})
- 一到三位数\n
- a newline \n
- 换行符\s*
- any 0+ whitespaces \s*
- 任何 0+ 个空格(\d{6,})
- Group 2: (\d{6,})
- 第 2 组:[^\S\n]*\n\s*
- 0 or more whitespace characters other than a newline char, a newline and then any 0 or more whitespaces [^\S\n]*\n\s*
- 除了换行符、换行符和任何 0 个或多个空格之外,还有 0 个或多个空格字符(\d{14})
- Group 3: fourteen digits (\d{14})
- 第 3 组:十四位数[^\S\n]*\n
- 0 or more whitespace characters other than a newline char and a newline char [^\S\n]*\n
- 除了换行符和换行符之外的 0 个或更多空白字符(.+)
- Group 4: any one or more characters other than line break chars, as many as possible (.+)
- 第 4 组:除换行符以外的任何一个或多个字符,尽可能多\n
- a newline \n
- 换行符(.+)
- Group 5: any one or more characters other than line break chars, as many as possible (.+)
- 第 5 组:除换行符以外的任何一个或多个字符,尽可能多Thanks a lot for Wiktor and QHarr for helping me a lot with this issue.非常感谢 Wiktor 和 QHarr 在这个问题上帮助我很多。 I appreciate a lot their help.
我非常感谢他们的帮助。 Here is the final code and I welcome any other ideas or modifications to the code
这是最终代码,我欢迎对代码进行任何其他想法或修改
Sub Test()
Dim x, a(1 To 1000, 1 To 5), bot As New ChromeDriver, col As Object, sInput As String, sPattern As String, i As Long, j As Long, cnt As Long
sPattern = "^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}"
With bot
.AddArgument "--headless"
.Get "file:///C:\Sample.html"
sInput = .FindElementByCss("table[id='all']").Text
End With
With CreateObject("VBScript.RegExp")
.Global = True: .MultiLine = True: .IgnoreCase = True
.Pattern = sPattern
If .Test(sInput) Then
Set col = .Execute(sInput)
For i = 0 To col.Count - 1
x = Split(col.Item(i), vbLf)
cnt = cnt + 1
For j = LBound(x) To UBound(x)
a(i + 1, j + 1) = Application.WorksheetFunction.Clean(Trim(x(j)))
Next j
Next i
End If
End With
ActiveSheet.Range("A1").Resize(cnt, UBound(a, 2)).Value = a
End Sub
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.