简体   繁体   English

c#正则表达式问题

[英]c# Regex problems

I have this Regex which I'm working on我有我正在研究的正则表达式

string addressstart = Regex.Escape("<a href=\"/url?q=");
                string addressend = Regex.Escape("&amp");
                string regAdd = addressstart + @"(.*?)" + addressend;

I'd like it to give me the url from this html我希望它给我这个 html 的 url

<a href="/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw">

so it should return " https://www.google.com/ "所以它应该返回“ https://www.google.com/

Any ideas Why it isnt working?任何想法为什么它不起作用? thanks!谢谢!

The following regex worked for me.以下正则表达式对我有用。 Make sure that you select group 1 , since group 0 is always the full string.确保选择组 1 ,因为组 0始终是完整字符串。

@"<a href=\"\/url\?q=(.*?)&amp"

As it appear you are looking for the url of google as part of your string.看起来您正在寻找 google 的 url 作为字符串的一部分。 You might find useful the following pattern which will match it:您可能会发现以下匹配它的模式很有用:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}

It is to be noted this is a small tweak of the general regex found at: What is a good regular expression to match a URL?需要注意的是,这是对通用正则表达式的一个小调整: What is a good regular expression to match a URL?

Edit Please see the code below in order to apply this regex and find the value you are looking for:编辑请参阅下面的代码以应用此正则表达式并找到您要查找的值:

string input = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
var regex = new Regex(@"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}");
var output = regex.Match(input).Value; // https://www.google.com

问题出在正则表达式的"<a href=\\"/url?q="部分。 ?没有转义。这意味着一个可选的l 。因此正则表达式的那部分匹配<a href="/urlq=<a href="/urq= 。都不包括?字符。

When parsing HTML, you should consider using some HTML parser, like HtmlAgilityPack, and only after getting the necessary node, apply the regex on the plain text.解析 HTML 时,应该考虑使用一些 HTML 解析器,例如 HtmlAgilityPack,并且只有在获取必要的节点后,才能在文本上应用正则表达式。

If you want to debug your own code, here is a fix:如果你想调试自己的代码,这里有一个修复:

using System;
using System.Text.RegularExpressions;

public class Test
{
    public static void Main()
    {
        var s = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
        var pattern = @"<a href=""/url\?q=(.*?)&amp;";
        var result = Regex.Match(s, pattern);
        if (result.Success)
            Console.WriteLine(result.Groups[1].Value);
    }
}

See a DotNetFiddle demo .请参阅DotNetFiddle 演示

Here is an example how how you may extract all <a> href attribute values that start with /url?q= with HtmlAgilityPack .下面是一个示例,如何使用HtmlAgilityPack提取所有以/url?q=开头的<a> href属性值。 Install it via Solution > Manage NuGet Packages for Solution... and use通过解决方案>管理解决方案的NuGet 包安装它...并使用

public List<string> HapGetHrefs(string html)
{
    var hrefs = new List<string>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//a[starts-with(@href, '/url?q=')]");
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           foreach (var attribute in node.Attributes)
               if (attribute.Name == "href")
               {
                   hrefs.Add(attribute.Value);
               }
        }
    }
    return hrefs;
 }

Then, all you need is apply a simpler regex or a couple of simpler string operations.然后,您所需要的只是应用一个更简单的正则表达式或几个更简单的字符串操作。

您可以使用:

(?<=a href="\/url\?q=)[^&]+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM