简体   繁体   English

使用RegEx提取html标签之间的文本

[英]use RegEx to extract text between html tags

I have to extract from a string in visual basic some text, like this: 我必须从Visual Basic中的字符串中提取一些文本,如下所示:

<div id="div">
<h2 id="id-date">09.09.2010</h2> , here to extract the date 

<h3 id="nr">000</h3> , here a number </div>

I need to extract the date from the div and the number all this from within the div... Also and this will be in loop, meaning there are more div block needed to be parsed.! 我需要从div中提取日期,并从div中提取所有数字……而且这将处于循环状态,这意味着需要解析更多的div块。 thank you! 谢谢! Adrian 阿德里安

Parsing HTML with regex is not ideal. 用正则表达式解析HTML是不理想的。 Others have suggested the HTML Agility Pack. 其他人建议使用HTML Agility Pack。 However, if you can guarantee that your input is well-defined and you always know what to expect then using a regex is possible. 但是,如果可以保证输入内容定义明确并且始终知道期望什么,则可以使用正则表达式。

If you can make that guarantee, read on. 如果可以保证,请继续阅读。 Otherwise you need to consider the other suggestions or define your input better. 否则,您需要考虑其他建议或更好地定义您的输入。 In fact, you should define your input better regardless because my answer makes a few assumptions. 实际上,无论我的回答有几个假设,您都应该更好地定义输入内容。 Some questions to consider: 需要考虑的一些问题:

  • Will the HTML be on one line or multiple lines, separated by newline characters? HTML是在一行还是多行上,由换行符分隔?
  • Will the HTML always be in the form of <div>...<h2...>...</h2><h3...>...</h3></div> ? HTML是否将始终采用<div>...<h2...>...</h2><h3...>...</h3></div> Or can there be h1-h6 tags? 还是可以有h1-h6标签?
  • On top of the hN tags, will the date and number always be between the tags with id-date and nr values for the id attribute? hN标签上,日期和数字是否始终位于id属性为id-datenr值的标签之间?

Depending on the answers to these questions the pattern can change. 根据这些问题的答案,模式可以改变。 The following code assumes each HTML fragment follows the structure you shared, that it will have an h2 and h3 with date and number, respectively, and that each tag will be on a new line. 以下代码假定每个HTML片段都遵循您共享的结构,它将分别具有带有日期和数字的h2h3 ,并且每个标记都将在新行上。 If you feed it different input it will likely break till the pattern matches your input's structure. 如果您输入不同的输入,则可能会中断,直到模式与您的输入的结构匹配为止。

Dim input As String = "<div id=""div"">" & Environment.Newline & _
               "<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
               "<h3 id=""nr"">000</h3>" & Environment.Newline & _
               "</div>"

Dim pattern As String = "<div[^>]+>.*?" & _
                 "<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
                 "<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"

Dim m As Match = Regex.Match(input, pattern, RegexOptions.Singleline)

If m.Success Then
    Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
    Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
    Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
    Console.WriteLine("Actual Date: " & actualDate)
    Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
    Console.WriteLine("Actual Number: " & actualNumber)
Else
    Console.WriteLine("No match!")
End If

The pattern can be on one line but I broke it up for clarity. 模式可以在一行上,但是为了清晰起见,我将其分解了。 RegexOptions.Singleline is used to allow the . RegexOptions.Singleline用于允许. metacharacter to handle \\n for newlines. 用于处理换行符\\n元字符。

You also said: 您还说过:

Also and this will be in loop, meaning there are more div block needed to be parsed. 而且,这将是循环的,这意味着需要解析更多的div块。

Are you looping over separate strings? 您是否在遍历单独的字符串? Or are you expecting multiple occurrences of the above HTML structure in a single string? 还是您希望单个字符串中多次出现上述HTML结构? If the former, the above code should be applied to each string. 如果是前者,则上述代码应应用于每个字符串。 For the latter you'll want to use Regex.Matches and treat each Match result similarly to the above piece of code. 对于后者,您将要使用Regex.Matches并像上面的代码一样对待每个Match结果。


EDIT: here is some sample code to demonstrate parsing multiple occurrences. 编辑:这是一些示例代码,以演示解析多次出现的情况。

Dim input As String = "<div id=""div"">" & Environment.Newline & _
               "<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
               "<h3 id=""nr"">000</h3>" & Environment.Newline & _
               "</div>" & _
               "<div id=""div"">" & Environment.Newline & _
               "<h2 id=""id-date"">09.14.2010</h2>" & Environment.Newline & _
               "<h3 id=""nr"">123</h3>" & Environment.Newline & _
               "</div>"

Dim pattern As String = "<div[^>]+>.*?" & _
                 "<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
                 "<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"

For Each m As Match In Regex.Matches(input, pattern, RegexOptions.Singleline)
    Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
    Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
    Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
    Console.WriteLine("Actual Date: " & actualDate)
    Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
    Console.WriteLine("Actual Number: " & actualNumber)
Next

You should not be parsing HTML with regular expressions because HTML is not regular as stated by Daniel Vandersluis. 您不应该使用正则表达式解析HTML,因为HTML并不是Daniel Vandersluis所说的正则。 You can use the HTML Agility Pack 您可以使用HTML Agility Pack

为什么不只使用HTML Agility Pack?

If your HTML tag have attributes , then here is my solution: 如果您的HTML tag具有attributes ,那么这是我的解决方案:

<TAG(.*?)>(.*?)</TAG>

Example ( using C# ): 示例( 使用C# ):

var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World

Try this taken from this link - 尝试从此链接获取此内容-

private string StripHTML(string htmlString)
{
    //This pattern Matches everything found inside html tags;
    //(.|\n) - > Look for any character or a new line
    // *?  -> 0 or more occurences, and make a non-greedy search meaning
    //That the match will stop at the first available '>' it sees, and not at the last one
    //(if it stopped at the last one we could have overlooked 
    //nested HTML tags inside a bigger HTML tag..)
    // Thanks to Oisin and Hugh Brown for helping on this one...

    string pattern = @"<(.|\n)*?>";  

    return  Regex.Replace(htmlString,pattern,string.Empty);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM