![](/img/trans.png)
[英]Perl extract domain name from email address inc tld but excluding subdomains
[英]Perl : Extract domain name
还有一个解析URL的请求,但是我发现了许多不完整或理论上的例子。 我想肯定有一些可以在perl中工作的东西。
我有以下网址:
https://vimdoc.sourceforge.net/htmldoc/pattern.html
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html
http://www.catonmat.net/download/perl1line.txt
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM
http://www.gnu.org/software/coreutils/manual/coreutils.html
http://www.catonmat.net/download/perl1line.txt
https://feedly.com/i/my
http://vimhelp.appspot.com/
https://git-scm.com/doc
https://read.amazon.com/
https://github.com/netsamir/following
https://scotch.io/
https://servicios.dgi.gub.uy/
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/
https://training.github.com/
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/
https://www.ctan.org/
https://www.eff.org/
https://www.mybeluga.com/
https://www.solveforx.com/
https://www.symynd.com/
https://www.symynd.com/#
https://www.tizen.org/
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS
尝试仅提取域名。 例如:
linksyssmartwifi.com
amazon.com
github.com
我已经尝试过Perl和Vim,但是无法完成任务。 我最好的近似如下
perl -pe 's!(^https?\://.*[\.](.+\..+?)/.*$)!$1 -- [$2] !g' all_urls_sorted.txt
其中一些已正确解析(请参阅[]),而其他则没有:
https://sites.google.com/site/steveyegge2/singleton-considered-stupid -- [google.com]
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/ -- [heroku.com]
https://training.github.com/ -- [github.com]
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/ -- [whatsapp.com]
https://wiki.haskell.org/GHC -- [haskell.org]
如我的测试所示,排除了直接从//(在https?://中)开始的URL。
如果您知道如何解决此问题,我将非常高兴。
谢谢
使用URI模块:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use URI;
while (<DATA>) {
chomp;
my $uri = URI->new($_);
my $host = $uri->host;
my ($domain) = $host =~ m/([^.]+\.[^.]+$)/;
say $domain;
}
__DATA__
https://vimdoc.sourceforge.net/htmldoc/pattern.html
http://linksyssmartwifi.com/ui/1.0.1.1001/dynamic/login.html
http://www.catonmat.net/download/perl1line.txt
https://github.com/robbyrussell/oh-my-zsh/wiki/Cheatsheet
https://drive.google.com/drive/u/0/folders/0B5jNDUmF2eUJuSnM
http://www.gnu.org/software/coreutils/manual/coreutils.html
http://www.catonmat.net/download/perl1line.txt
https://feedly.com/i/my
http://vimhelp.appspot.com/
https://git-scm.com/doc
https://read.amazon.com/
https://github.com/netsamir/following
https://scotch.io/
https://servicios.dgi.gub.uy/
https://sourcemaking.com/
https://stackedit.io/editor
https://stripe.com/be
https://toolbelt.heroku.com/
https://training.github.com/
https://vimeo.com/54505525
https://vimeo.com/tag:drew+neil
https://web.whatsapp.com/
https://www.ctan.org/
https://www.eff.org/
https://www.mybeluga.com/
https://www.solveforx.com/
https://www.symynd.com/
https://www.symynd.com/#
https://www.tizen.org/
http://workforall.net/CDS-Credit-default-Swaps.html#Credit_Default_Swaps_CDS
输出:
sourceforge.net
linksyssmartwifi.com
catonmat.net
github.com
google.com
gnu.org
catonmat.net
feedly.com
appspot.com
git-scm.com
amazon.com
github.com
scotch.io
gub.uy
sourcemaking.com
stackedit.io
stripe.com
heroku.com
github.com
vimeo.com
vimeo.com
whatsapp.com
ctan.org
eff.org
mybeluga.com
solveforx.com
symynd.com
symynd.com
tizen.org
workforall.net
我最好的近似是URI :: URL :
foreach my $uri (@filecontents) {
my $uriobj = URL::URL->new($uri);
my $host = $uriobj -> host;
my @parts = split /\./, $host;
print "$uri -- $parts[-2]$parts[-1]\n";
}
希望能有所帮助。
正则表达式解决方案是:
//(?:[^./]+[.])*([^/.]+[.][^/.]+)/
如果结尾的斜杠是可选的,则只需添加一个?
:
//(?:[^./]+[.])*([^/.]+[.][^/.]+)/?
这应该与全球改性剂和以外的分隔符可以使用/
。
本质上,它在//
和下一个/
之间寻找。
如果有任何其他子域,它们将被(?:[^./]+[.])*
捕获。 主域将属于捕获组([^/.]+[.][^/.]+)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.