简体   繁体   中英

c# Regex problems

I have this Regex which I'm working on

string addressstart = Regex.Escape("<a href=\"/url?q=");
                string addressend = Regex.Escape("&amp");
                string regAdd = addressstart + @"(.*?)" + addressend;

I'd like it to give me the url from this html

<a href="/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw">

so it should return " https://www.google.com/ "

Any ideas Why it isnt working? thanks!

The following regex worked for me. Make sure that you select group 1 , since group 0 is always the full string.

@"<a href=\"\/url\?q=(.*?)&amp"

As it appear you are looking for the url of google as part of your string. You might find useful the following pattern which will match it:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}

It is to be noted this is a small tweak of the general regex found at: What is a good regular expression to match a URL?

Edit Please see the code below in order to apply this regex and find the value you are looking for:

string input = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
var regex = new Regex(@"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}");
var output = regex.Match(input).Value; // https://www.google.com

问题出在正则表达式的"<a href=\\"/url?q="部分。 ?没有转义。这意味着一个可选的l 。因此正则表达式的那部分匹配<a href="/urlq=<a href="/urq= 。都不包括?字符。

When parsing HTML, you should consider using some HTML parser, like HtmlAgilityPack, and only after getting the necessary node, apply the regex on the plain text.

If you want to debug your own code, here is a fix:

using System;
using System.Text.RegularExpressions;

public class Test
{
    public static void Main()
    {
        var s = "<a href=\"/url?q=https://www.google.com/&amp;sa=U&amp;ved=0ahUKEwizwPy0yNHSAhXMDpAKHec7DAsQFgh6MA0&amp;usg=AFQjCNEjJILXPMMCNAlz5MN1IIzjpr79tw\">";
        var pattern = @"<a href=""/url\?q=(.*?)&amp;";
        var result = Regex.Match(s, pattern);
        if (result.Success)
            Console.WriteLine(result.Groups[1].Value);
    }
}

See a DotNetFiddle demo .

Here is an example how how you may extract all <a> href attribute values that start with /url?q= with HtmlAgilityPack . Install it via Solution > Manage NuGet Packages for Solution... and use

public List<string> HapGetHrefs(string html)
{
    var hrefs = new List<string>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//a[starts-with(@href, '/url?q=')]");
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           foreach (var attribute in node.Attributes)
               if (attribute.Name == "href")
               {
                   hrefs.Add(attribute.Value);
               }
        }
    }
    return hrefs;
 }

Then, all you need is apply a simpler regex or a couple of simpler string operations.

您可以使用:

(?<=a href="\/url\?q=)[^&]+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM