如何使用Perl从纯文本中提取URL？

Question

我需要Perl正则表达式来解析纯文本输入并将所有链接转换为有效的HTML HREF链接。 我已经尝试了10个不同的版本，我在网上找到但没有一个看到正常工作。 我还测试了StackOverflow上发布的其他解决方案，但这些解决方案似乎都不起作用。 正确的解决方案应该能够在纯文本输入中找到任何URL并将其转换为：

<a href="$1">$1</a>

我尝试过的其他正则表达式的一些情况包括：

一行末尾的URL，后面跟着返回
包含问号的网址
以“https”开头的网址

我希望那里的另一个Perl家伙已经有了他们正在使用的正则表达式，他们可以分享。 在此先感谢您的帮助！

Answer 1

你想要URI :: Find 。 提取链接后，您应该能够正确处理问题的其余部分。

这在perlfaq9的回答“如何提取URL？”中得到了解答。 ，顺便说说。 perlfaq中有很多好东西。 :)

Answer 2

除了URI::Find ，还要检查大型正则表达式数据库： Regexp::Common ，有一个Regexp :: Common :: URI模块，可以为您提供以下简单的方法：

my ($uri) = $str =~ /$RE{URI}{-keep}/;

如果你想在那个uri中使用不同的部分（主机名，查询参数等），请参阅Regexp :: Common :: URI :: http的文档，了解在$RE{URI}正则表达式中捕获的内容。

Answer 3

当我尝试使用以下文本的URI :: Find :: Schemeless时：

Here is a URL  and one bare URL with 
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)

Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://user@example.org/test/me
How about one without a protocol www.example.com?

它搞砸了http://example.org/(9.3) 。 所以，我在Regexp :: Common的帮助下想出了以下内容：

#!/usr/bin/perl

use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;

my $heuristic = URI::Find::Schemeless->schemeless_uri_re;

my $pattern = qr{
    $RE{URI}{HTTP}{-scheme=>'https?'} |
    $RE{URI}{FTP} |
    $heuristic
}x;

local $/ = '';

while ( my $par = <DATA> ) {
    chomp $par;
    $par =~ s/</&lt;/g;
    $par =~ s/( $pattern ) / linkify($1) /gex;
    print "<p>$par</p>\n";
}

sub linkify {
    my ($str) = @_;
    $str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
    $str = escapeHTML($str);
    sprintf q|<a href="%s">%s</a>|, ($str) x 2;
}

这适用于显示的输入。 当然，通过尝试(http://example.org/(9.3)) ，生活从未如此简单。

Answer 4

在这里，我使用如何提取网址发布了示例代码。 这里将采用stdin的线条。 它将检查输入行是否包含有效的URL格式。 它会给你一个URL

use strict;
use warnings;

use Regexp::Common qw /URI/;

while (1)
{
        #getting the input from stdin.
        print "Enter the line: \n";
        my $line = <>;
        chomp ($line); #removing the unwanted new line character
        my ($uri)= $line =~ /$RE{URI}{HTTP}{-keep}/       and  print "Contains an HTTP URI.\n";
        print "URL : $uri\n" if ($uri);
}

我得到的样本输出如下

Enter the line:
http://stackoverflow.com/posts/2565350/
Contains an HTTP URI.
URL : http://stackoverflow.com/posts/2565350/
Enter the line:
this is not valid url line
Enter the line:
www.google.com
Enter the line:
http://
Enter the line:
http://www.google.com
Contains an HTTP URI.
URL : http://www.google.com

如何使用Perl从纯文本中提取URL？

问题描述

4 个解决方案

解决方案1
10 2010-04-02 01:56:48

解决方案2
4 2010-04-02 04:06:40

解决方案3
2 已采纳 2010-04-02 06:10:04

解决方案4
1 2010-04-02 06:36:12

如何使用Perl从纯文本中提取URL？

问题描述

4 个解决方案

解决方案1 10 2010-04-02 01:56:48

解决方案2 4 2010-04-02 04:06:40

解决方案3 2 已采纳 2010-04-02 06:10:04

解决方案4 1 2010-04-02 06:36:12

解决方案1
10 2010-04-02 01:56:48

解决方案2
4 2010-04-02 04:06:40

解决方案3
2 已采纳 2010-04-02 06:10:04

解决方案4
1 2010-04-02 06:36:12