简体   繁体   中英

Perl regex disable parenthesis extraction

I am trying something that I found on another answer but I am having some problems:

I know that there are better regex for URLs but consider this for example:

@links=($content =~ m/(https?)?.*[.]com/g);
*$content has text or html

The part (https?)? is for links like www.google.com , but having the parenthesis it returns "http" to $1 which is put into @links ! That is a problem, since I want the whole link.

What would globally extract simple links (or whatever regex is specified) from text and put them into a list?
By simple, I mean:

  • http://www.google.com
  • www.google.com
  • google.com
  • https://www.google.com

Your approach is too naive, it won't catch many other URLs. Instead, use Regexp::Common, like this:

use Regexp::Common qw/URI/;

my @links = ($content =~ /$RE{URI}/g);

This works for HTTP, HTTPS, FTP, etc and properly captures more advanced combinations for URL parameters.

Non-capturing version looks like this:

m/(?:https?)?.*[.]com/g

For capturing links, I use this regex, derived from URI::Find:

m<https?://[;/\?:\@&=+\$,\[\]A-Za-z0-9\-_.!~*'()%#]*[/\?:\@&=+\$\[A-Za-z0-9\-_!~*(%#]>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM