简体   繁体   English

正则表达式匹配和子字符串合而为一?

[英]Regex match and substring in one?

I have a HTML source as input and would like to know what CMS the website is made in. Many CMS leave their name in a meta tag like this: 我有一个HTML源作为输入,并且想知道该网站是用什么CMS制成的。许多CMS都将其名称保留在这样的元标记中:

<meta name="Generator" content="MY CMS" />   

I can get the result like this: 我可以得到这样的结果:

        Match match = Regex.Match(html, ".*(?i)meta.*generator.*");
        match = Regex.Match(match.ToString(), "content.*\".*\"");
        match = Regex.Match(match.ToString(), "\".*\"");

Gives me "MY CMS" 给我“我的CMS”

But is there any way to shorten it down to one Regex.Match? 但是有什么方法可以将其缩短为一个Regex.Match吗?

Please notice, that the meta tag could be like this: 请注意,meta标签可能是这样的:

<meta content="MY CMS" name="Generator" />

Thanks and best regards 谢谢和最好的问候

var regex = new Regex(@"<meta\s+name=""Generator""\s+content=""([^""]+)""", RegexOptions.IgnoreCase);
var match = regex.Match(html);
var generator = match.Groups[1].Value;

Try the following: 请尝试以下操作:

Regex regex = new Regex(@"<meta[^>]+content\s*=\s*['"]([^'"]+)['"][^>]*>");
Match match = regex.Match(input);

The value is in group 1. 该值在组1中。

Hope it helps. 希望能帮助到你。

Regex is not a good choice for parsing HTML files.. 正则表达式不是解析HTML文件的好选择。

HTML is not strict nor is it regular with its format.. HTML既不严格也不规范其格式。

Use htmlagilitypack 使用htmlagilitypack

Regex is used for Regular expression NOT Irregular expression 正则表达式用于正则表达式NOT 不规则表达式

You can use this code to retrieve it using HtmlAgilityPack 您可以使用此代码通过HtmlAgilityPack进行检索

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

var content= doc.DocumentNode
                .SelectSingleNode("//meta[@name='Generator']")
                .Attributes["content"].Value;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM