I am trying something that I found on another answer but I am having some problems:
I know that there are better regex for URLs but consider this for example:
@links=($content =~ m/(https?)?.*[.]com/g);
*$content has text or html
The part (https?)?
is for links like www.google.com
, but having the parenthesis it returns "http"
to $1
which is put into @links
! That is a problem, since I want the whole link.
What would globally extract simple links (or whatever regex is specified) from text and put them into a list?
By simple, I mean:
http://www.google.com
www.google.com
google.com
https://www.google.com
Your approach is too naive, it won't catch many other URLs. Instead, use Regexp::Common, like this:
use Regexp::Common qw/URI/;
my @links = ($content =~ /$RE{URI}/g);
This works for HTTP, HTTPS, FTP, etc and properly captures more advanced combinations for URL parameters.
Non-capturing version looks like this:
m/(?:https?)?.*[.]com/g
For capturing links, I use this regex, derived from URI::Find:
m<https?://[;/\?:\@&=+\$,\[\]A-Za-z0-9\-_.!~*'()%#]*[/\?:\@&=+\$\[A-Za-z0-9\-_!~*(%#]>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.