简体   繁体   English

用C#提取html元素的值

[英]Extracting value of html element with C#

In Wordpress generated pages, there is the following meta tag: 在Wordpress生成的页面中,有以下元标记:

<meta name="generator" content="WordPress 3.4.2" />

I'm looking for a way to easily extract, "3.4.2" (in the above example) 我正在寻找一种轻松提取“ 3.4.2”的方法(在上面的示例中)

Would using XmlDocument or Regular Expression be faster? 使用XmlDocument或正则表达式会更快吗?

I found JSoup, but that's overkill for what I'm trying to do. 我找到了JSoup,但这对我想做的事来说太过分了。

EDIT 编辑

Just to clarify - I don't want to include any external libraries. 只是为了澄清-我不想包括任何外部库。
Also, this is running in a class library, so using powershell isn't going to be an option either. 而且,它正在类库中运行,因此使用powershell也不是一种选择。

As you're not trying to match paired tags or anything, a regular expression should be fine. 由于您不打算匹配配对标签或其他任何内容,因此正则表达式应该可以。 Just search for content="WordPress (\\d\\.\\d\\.\\d) or similar. (If it's really consistent, you could search for the whole meta tag.) 只需搜索content="WordPress (\\d\\.\\d\\.\\d)或类似的内容即可(如果确实一致,则可以搜索整个meta标签。)

Trying to parse an HTML page as an XmlDocument might not work out; 尝试将HTML页面解析为XmlDocument可能无法解决; not all valid (or browser-supported) HTML is valid XML. 并非所有有效(或受浏览器支持的)HTML都是有效XML。

Make use of HTML Agility Pack to parse the HTML 利用HTML Agility Pack解析HTML

在此处输入图片说明

EDIT (code to copy) 编辑(复制代码)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HTMLAgilityExample
{
    class Program
    {
        static void Main(string[] args)
        {
            string contentValue;

            HtmlDocument document = new HtmlDocument();
            document.Load("C:/test.html");
            foreach(HtmlNode link in document.DocumentNode.SelectNodes("//meta[@content]"))
            {
                HtmlAttribute attribute = link.Attributes["content"];
                if(attribute.Value.Contains("WordPress"))
                {
                    contentValue = attribute.Value.Replace("WordPress", "").Trim();
                }
            }
        }
    }
}

I guess that since you have to parse the version out of the attribute value anyway, and since it sounds like you're not looking to do any extensive HTML parsing beyond this task, I'd suggest a regular expression. 我猜想因为既然您仍然必须从属性值中解析出版本,并且听起来好像您不想在此任务之外进行任何广泛的HTML解析,所以建议使用正则表达式。

This should give you a start. 这应该给您一个开始。 The expression can be simplified a bit; 表达式可以简化一些。 maybe it is unnecessary to specify that the attribute value is within a meta tag. 也许没有必要指定属性值在meta标签内。 Or it can be tightened up a bit; 或者可以收紧一点; maybe it would be better to specify the "content" attribute. 也许最好指定“ content”属性。 Either way, this worked in my quick testing. 无论哪种方式,这在我的快速测试中都有效。

Note that for better readability, I like to leave whitespace within the regular expression and include the IgnorePatternWhitespace option. 请注意,为了获得更好的可读性,我希望将空白留在正则表达式中,并包含IgnorePatternWhitespace选项。

var html = ""; // Populate the html string here

var options = RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace;
var regx = new Regex( "<meta\\s+? .*? WordPress\\s*? (?<version> [\\d\\.]+) [^\\d\\.] .*? />", options );

var match = regx.Match( html );

if ( match.Success ) {
    var version = match.Groups["version"].Value;
}

You could use powershell: 您可以使用powershell:

PS> [xml]$xml = '<meta name="generator" content="WordPress 3.4.2" />'
PS> ($xml.meta.content) -match "[\d\.]+"
True
PS> $matches[0]
3.4.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM