简体   繁体   中英

Regular expression for recognizing url

I want to create a Regex for url in order to get all links from input string. The Regex should recognize the following formats of the url address:

  • http(s)://www.webpage.com
  • http(s)://webpage.com
  • www.webpage.com

and also the more complicated urls like: - http://www.google.pl/#sclient=psy&hl=pl&site=&source=hp&q=regex+url&pbx=1&oq=regex+url&aq=f&aqi=g1&aql=&gs_sm=e&gs_upl=1582l3020l0l3199l9l6l0l0l0l0l255l1104l0.2.3l5l0&bav=on.2,or.r_gc.r_pw.&fp=30a1604d4180f481&biw=1680&bih=935

I have the following one

((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)

but it does not recognize the following pattern: www.webpage.com. Can someone please help me to create an appropriate Regex?

EDIT: It should works to find an appropriate link and moreover place a link in an appropriate index like this:

private readonly Regex RE_URL = new Regex(@"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", RegexOptions.Multiline);
foreach (Match match in (RE_URL.Matches(new_text)))
            {
                // Copy raw string from the last position up to the match
                if (match.Index != last_pos)
                {
                    var raw_text = new_text.Substring(last_pos, match.Index - last_pos);
                    text_block.Inlines.Add(new Run(raw_text));
                }

                // Create a hyperlink for the match
                var link = new Hyperlink(new Run(match.Value))
                {
                    NavigateUri = new Uri(match.Value)
                };
                link.Click += OnUrlClick;

                text_block.Inlines.Add(link);

                // Update the last matched position
                last_pos = match.Index + match.Length;
            }

I don't know why your result in match is only http:// but I cleaned your regex a bit

((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:#@%/;$()~_?\+,\-=\\.&]+)

(?:) are non capturing groups, that means there is only one capturing group left and this contains the complete matched string.

(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\\\\\)(?:www\\.)?|www\\.) The link has now to start with something fom the first list followed by an optional www. or with an www.

[\\w\\d:#@%/;$()~_?\\+,\\-=\\\\.&] I added a comma to the list (otherwise your long example does not match) escaped the - (you were creating a character range) and unescaped the . (not needed in a character class.

See this here on Regexr , a useful tool to test regexes.

But URL matching is not a simple task, please see this question here

I've just written up a blog post on recognising URLs in most used formats such as:

www.google.com http://www.google.com mailto:somebody@google.com somebody@google.com www.url-with-querystring.com/?url=has-querystring

The regular expression used is /((([A-Za-z]{3,9}:(?:\\/\\/)?)(?:[-;:&=\\+\\$,\\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\\+\\$,\\w]+@)[A-Za-z0-9.-]+)((?:\\/[\\+~%\\/.\\w-_]*)?\\??(?:[-\\+=&;%@.\\w_]*)#?(?:[\\w]*))?)/ however I would recommend you got to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see a complete working example along with an explanation of the regular expression in case you need to extend or tweak it.

The regex you give doesn't work for www. addresses because it is expecting a URI scheme (the bit before the URL, like http://). The 'www.' part in your regular expression doesn't work because it would only match www.:// (which is meaningless)

Try something like this instead:

(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\\.&]*)

This will match something with a valid URI scheme, or something beginning with 'www.'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM