Perl regex disable parenthesis extraction

Question

I am trying something that I found on another answer but I am having some problems:

I know that there are better regex for URLs but consider this for example:

@links=($content =~ m/(https?)?.*[.]com/g);
*$content has text or html

The part (https?)? is for links like www.google.com , but having the parenthesis it returns "http" to $1 which is put into @links ! That is a problem, since I want the whole link.

What would globally extract simple links (or whatever regex is specified) from text and put them into a list?
By simple, I mean:

http://www.google.com
www.google.com
google.com
https://www.google.com

Answer 1

Your approach is too naive, it won't catch many other URLs. Instead, use Regexp::Common, like this:

use Regexp::Common qw/URI/;

my @links = ($content =~ /$RE{URI}/g);

This works for HTTP, HTTPS, FTP, etc and properly captures more advanced combinations for URL parameters.

Answer 2

Non-capturing version looks like this:

m/(?:https?)?.*[.]com/g

For capturing links, I use this regex, derived from URI::Find:

m<https?://[;/\?:\@&=+\$,\[\]A-Za-z0-9\-_.!~*'()%#]*[/\?:\@&=+\$\[A-Za-z0-9\-_!~*(%#]>

Perl regex disable parenthesis extraction

Question

2 answers

solution1
5 2012-10-29 00:57:20

solution2
3 ACCPTED 2012-10-29 02:45:50

Perl regex disable parenthesis extraction

Question

2 answers

solution1 5 2012-10-29 00:57:20

solution2 3 ACCPTED 2012-10-29 02:45:50

solution1
5 2012-10-29 00:57:20

solution2
3 ACCPTED 2012-10-29 02:45:50