[英]use RegEx to extract text between html tags
I have to extract from a string in visual basic some text, like this: 我必须从Visual Basic中的字符串中提取一些文本,如下所示:
<div id="div">
<h2 id="id-date">09.09.2010</h2> , here to extract the date
<h3 id="nr">000</h3> , here a number </div>
I need to extract the date from the div and the number all this from within the div... Also and this will be in loop, meaning there are more div block needed to be parsed.! 我需要从div中提取日期,并从div中提取所有数字……而且这将处于循环状态,这意味着需要解析更多的div块。 thank you!
谢谢! Adrian
阿德里安
Parsing HTML with regex is not ideal. 用正则表达式解析HTML是不理想的。 Others have suggested the HTML Agility Pack.
其他人建议使用HTML Agility Pack。 However, if you can guarantee that your input is well-defined and you always know what to expect then using a regex is possible.
但是,如果可以保证输入内容定义明确并且始终知道期望什么,则可以使用正则表达式。
If you can make that guarantee, read on. 如果可以保证,请继续阅读。 Otherwise you need to consider the other suggestions or define your input better.
否则,您需要考虑其他建议或更好地定义您的输入。 In fact, you should define your input better regardless because my answer makes a few assumptions.
实际上,无论我的回答有几个假设,您都应该更好地定义输入内容。 Some questions to consider:
需要考虑的一些问题:
<div>...<h2...>...</h2><h3...>...</h3></div>
? <div>...<h2...>...</h2><h3...>...</h3></div>
? Or can there be h1-h6
tags? h1-h6
标签? hN
tags, will the date and number always be between the tags with id-date
and nr
values for the id
attribute? hN
标签上,日期和数字是否始终位于id
属性为id-date
和nr
值的标签之间? Depending on the answers to these questions the pattern can change. 根据这些问题的答案,模式可以改变。 The following code assumes each HTML fragment follows the structure you shared, that it will have an
h2
and h3
with date and number, respectively, and that each tag will be on a new line. 以下代码假定每个HTML片段都遵循您共享的结构,它将分别具有带有日期和数字的
h2
和h3
,并且每个标记都将在新行上。 If you feed it different input it will likely break till the pattern matches your input's structure. 如果您输入不同的输入,则可能会中断,直到模式与您的输入的结构匹配为止。
Dim input As String = "<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">000</h3>" & Environment.Newline & _
"</div>"
Dim pattern As String = "<div[^>]+>.*?" & _
"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"
Dim m As Match = Regex.Match(input, pattern, RegexOptions.Singleline)
If m.Success Then
Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
Console.WriteLine("Actual Date: " & actualDate)
Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
Console.WriteLine("Actual Number: " & actualNumber)
Else
Console.WriteLine("No match!")
End If
The pattern can be on one line but I broke it up for clarity. 模式可以在一行上,但是为了清晰起见,我将其分解了。
RegexOptions.Singleline
is used to allow the .
RegexOptions.Singleline
用于允许.
metacharacter to handle \\n
for newlines. 用于处理换行符
\\n
元字符。
You also said: 您还说过:
Also and this will be in loop, meaning there are more div block needed to be parsed.
而且,这将是循环的,这意味着需要解析更多的div块。
Are you looping over separate strings? 您是否在遍历单独的字符串? Or are you expecting multiple occurrences of the above HTML structure in a single string?
还是您希望单个字符串中多次出现上述HTML结构? If the former, the above code should be applied to each string.
如果是前者,则上述代码应应用于每个字符串。 For the latter you'll want to use
Regex.Matches
and treat each Match
result similarly to the above piece of code. 对于后者,您将要使用
Regex.Matches
并像上面的代码一样对待每个Match
结果。
EDIT: here is some sample code to demonstrate parsing multiple occurrences. 编辑:这是一些示例代码,以演示解析多次出现的情况。
Dim input As String = "<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">000</h3>" & Environment.Newline & _
"</div>" & _
"<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.14.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">123</h3>" & Environment.Newline & _
"</div>"
Dim pattern As String = "<div[^>]+>.*?" & _
"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"
For Each m As Match In Regex.Matches(input, pattern, RegexOptions.Singleline)
Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
Console.WriteLine("Actual Date: " & actualDate)
Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
Console.WriteLine("Actual Number: " & actualNumber)
Next
You should not be parsing HTML with regular expressions because HTML is not regular as stated by Daniel Vandersluis. 您不应该使用正则表达式解析HTML,因为HTML并不是Daniel Vandersluis所说的正则。 You can use the HTML Agility Pack
您可以使用HTML Agility Pack
为什么不只使用HTML Agility Pack?
If your HTML tag
have attributes
, then here is my solution: 如果您的
HTML tag
具有attributes
,那么这是我的解决方案:
<TAG(.*?)>(.*?)</TAG>
Example ( using C# ): 示例( 使用C# ):
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World
Try this taken from this link - 尝试从此链接获取此内容-
private string StripHTML(string htmlString)
{
//This pattern Matches everything found inside html tags;
//(.|\n) - > Look for any character or a new line
// *? -> 0 or more occurences, and make a non-greedy search meaning
//That the match will stop at the first available '>' it sees, and not at the last one
//(if it stopped at the last one we could have overlooked
//nested HTML tags inside a bigger HTML tag..)
// Thanks to Oisin and Hugh Brown for helping on this one...
string pattern = @"<(.|\n)*?>";
return Regex.Replace(htmlString,pattern,string.Empty);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.