简体   繁体   English

在C#中的html文档中,与该模式匹配的REGEX是什么?

[英]What is the REGEX to match this pattern in a html document in C#?

I really can't work out how to best do this, I can do fairly simple regex expressions, but the more complex ones really stump me. 我真的不知道如何最好地做到这一点,我可以做相当简单的regex表达式,但是更复杂的regex表达式确实使我感到困惑。

The following appears in specific HTML documents: 以下内容出现在特定的HTML文档中:

<span id="label">
<span>
<a href="http://variableLink">Joe Bloggs</a>
now using
</span>
<span>
'
<a href="/variableLink/">Important Data</a>
'
</span>
<span>
on
<a href="/variableLink">Important data 2</a>
</span>
</span>

I need to extract the two 'important data' points and could spend hours working out the regex to do it.(I'm using the .net Regex Library in C# 3.5) 我需要提取两个“重要数据”点,并可能要花费数小时来研究正则表达式(我正在使用C#3.5中的.net正则表达式库)。

As often stated befor, regular expressions are usually not the right tool for parsing HTML, XML, and friends - think about using HTML or XML parsing libraries. 如前所述,正则表达式通常不是解析HTML,XML和好友的正确工具-考虑使用HTML或XML解析库。 If you really want to or have to use regular expressions, the following will match the content of the tags in many cases, but might still fail in some cases. 如果您确实想要或必须使用正则表达式,则在许多情况下,以下内容将与标记的内容匹配,但在某些情况下仍可能失败。

<a href="[^"]*">(?<data>[^<]*)</a>

This expression will match all links not starting with http:// - this is the only obviouse difference I can see between the links. 该表达式将匹配所有不以http://开头的链接-这是链接之间唯一可见的区别。

<a href="(?!http://)[^"]*">(?<data>[^<]*)</a>

The below uses HtmlAgilityPack . 下面使用HtmlAgilityPack It prints any text within a second-or-later link within the "label" id. 它在“标签” ID内的第二个或更高版本的链接中打印任何文本。 Of course, it's relatively simple to modify the XPath to do something a little different. 当然,修改XPath进行一些不同的操作相对简单。

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<span id=""label"">
<span>
<a href=""http://variableLink"">Joe Bloggs</a>
now using
</span>
<span>
'
<a href=""/variableLink/"">Important Data</a>
'
</span>
<span>
on
<a href=""/variableLink"">Important data 2</a>
</span>
</span>
"));
    HtmlNode root = doc.DocumentNode;

    HtmlNodeCollection anchors;
    anchors = root.SelectNodes("//span[@id='label']/span[position()>=2]/a/text()");
    IList<string> importantStrings;
    if(anchors != null)
    {
        importantStrings = new List<string>(anchors.Count);
        foreach(HtmlNode anchor in anchors)
        importantStrings.Add(((HtmlTextNode)anchor).Text);
    }
    else
        importantStrings = new List<string>(0);

    foreach(string s in importantStrings)
        Console.WriteLine(s);

Look up look-behind and look-ahead syntax for .NET and use that to look for the anchor tags in the HTML. 查找.NET的后向和前向语法,并使用该语法在HTML中查找锚标记。 This site may help you. 该站点可能会为您提供帮助。 As an alternative to regular expressions, you might consider using a System.Xml.XPath.XPathNavigator to address those nodes directly. 作为正则表达式的替代方法,您可以考虑使用System.Xml.XPath.XPathNavigator直接寻址这些节点。

我的Regex有点生锈,但是遵循以下内容可能会有所帮助(尽管可能需要进行一些微调):

(?<=\<a href="/variableLink[/]?"\>)(.*)+(?=</a>)
  <a\shref.*?"/variableLink/?">(.*)</a>

First group contains the Name of the anchors. 第一组包含锚点的名称。 Tested with Expresso. 经过Expresso测试。 Works on the sample text you've provided. 适用于您提供的示例文本。
Update : works with Snippy too. 更新 :也适用于Snippy。

Regex regex = new Regex(@"<a\shref.*?""/variableLink/?"">(.*)</a>", RegexOptions.Multiline);
foreach (Match everyMatch in regex.Matches(sText))
{
  Console.WriteLine("{0}", everyMatch.Groups[1]);
}

Outputs: 输出:

Important Data
Important data 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM