简体   繁体   中英

Character escaping in regular expression

I used this regex to match a hyperlink that contains a specific word in the href

<a( .*?)? href=\".*?" + word + ".*?\"( .*?)?>.*?</a>

This returns the first appearance of the matched link

Now i need to find all hyperlinks withthe same match, and I have tried this regex:

/<a [^>]*\bhref\s*=\s*"[^"]*word.*?<\/a>/

I'm having some problems making my compiler accept this expression. The problem seems to be escaping some special characters. It seems this part is a problem

"[^"]

I tried escaping the [ with \\, and putting @ in front of double quotes, but no luck.

The error reads "bad compile constant value".

Does anyone know how to format this regex to satisfy the compiler?

Regex is not a good way to parse HTML files..

You should use htmlagilitypack

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://yourWebSite.com");

List<String> hrefLst=doc.DocumentNode
                        .SelectNodes("//a[@href]")
                        .Select(x=>x.Attributes["href"].Value)
                        .Where(y=>y.Contains(word))
                        .ToList();

hrefLst now has all your required links.

Isn't that simple!

Although you can escape everything that needs to be escaped in the string, regular expressions are far easier to read when the string is @-quoted. The only thing you then need to worry about are double quotes, which need to be doubled.

string expression = @"/<a [^>]*\bhref\s*=\s*""[^""]*word.*?<\/a>/";

Note: As the comments say, this regex might fail. I haven't tested it, I just modified it to make it compile.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM