简体   繁体   English

sed / perl中非贪婪的正则表达式匹配

[英]Non greedy regex matching in sed/perl

I was doing sed /http.*.torrent/s/.*(http.*.torrent).*/\\1/;/http.*.torrent/p 1.html to extract links. 我正在做sed /http.*.torrent/s/.*(http.*.torrent).*/\\1/;/http.*.torrent/p 1.html .*. sed /http.*.torrent/s/.*(http.*.torrent).*/\\1/;/http.*.torrent/p 1.html来提取链接。 However since sed lacks non-greedy quantifier (which is needed because further in the line there is again 'torrent'), tried to convert it to perl. 然而,因为sed缺少非贪婪的量词(这是需要的,因为在行中还有'torrent'),试图将它转换为perl。 Though need help with perl. 虽然需要帮助perl。 (Or if you know how to do it with sed, say so.) perl -ne s/.*(http.*?.torrent).*/\\1/ 1.html Now I need to add this part, after convering it from sed: /http.*.torrent/p (或者,如果你知道如何用sed做,请说明。) perl -ne s/.*(http.*?.torrent).*/\\1/ 1.html现在我需要在收敛后添加这部分它来自sed:/ /http.*.torrent/p

This was a part of sed /http.*.torrent/s/.*(http.*.torrent).*/\\1/;/http.*.torrent/p 1.html 这是sed /http.*.torrent/s/.*(http.*.torrent).*/\\1/;/http.*.torrent/p 1.html的一部分。

but this didn't work either; 但这也不起作用; sed started but didn't quit, and as I pressed keys they echoed and nothing else. sed开始但没有退出,当我按下键时,他们回应而没有别的。

I recommend letting a well proven module such as HTML::LinkExtor do the heavy lifting for you, and use a regexp simply to validate the links that it finds. 我建议让一个经过充分验证的模块(如HTML :: LinkExtor)为您做繁重的工作,并使用regexp简单地验证它找到的链接。 See the example below of just how easy it can be. 请参阅下面的示例,了解它是多么容易。

use Modern::Perl;
use HTML::LinkExtor;
use Data::Dumper;

my @links;


# A callback for LinkExtor. Disqualifies non-conforming links, and pushes
# into @links any conforming links.

sub callback {
    my ( $tag, %attr ) = @_;
    return if $tag ne 'a';
    return unless $attr{href} =~ m{http(?:s)?://[^/]*torrent}i;
    push @links, \%attr;
}


# The work is done here: Read the html file, parse it, and move on.
undef $/;
my $html = <DATA>;
my $p = HTML::LinkExtor->new(\&callback);
$p->parse( $html );

print Dumper \@links;

__DATA__
<a href="https://toPB.torrent" title="Download this torrent">The goal</a>
<a href="http://this.is.my.torrent.com" title="testlink">Testing2</a> <a href="http://another.torrent.org" title="bwahaha">Two links on one line</a>
<a href="https://toPBJ.torrent.biz" title="Last test">Final Test</a>
A line of nothingness...
That's all folks.

HTML::LinkExtor lets you set up a callback function. HTML :: LinkExtor允许您设置回调函数。 The module itself parses your HTML document to find any links. 模块本身会解析您的HTML文档以查找任何链接。 You are looking for the 'a' links (as opposed to 'img', etc.). 您正在寻找'a'链接(而不是'img'等)。 So in your callback function you just exit as soon as possible unless you have an 'a' link. 所以在你的回调函数中,你只需要尽快退出,除非你有一个'a'链接。 Then test that 'a' link to see if there's a 'torrent' name in it, in an appropriate position. 然后测试'a'链接以查看其中是否存在“torrent”名称,位于适当的位置。 If that particular regexp isn't what you need, you'll have to be more specific, but I think it's what you were after. 如果那个特定的正则表达式不是你需要的,你必须更具体,但我认为这就是你所追求的。 As links are found they're pushed onto a data structure. 当链接被发现时,它们被推送到数据结构中。 At the end of my test script I print the structure so you can see what you have. 在我的测试脚本结束时,我打印结构,以便您可以看到您拥有的内容。

The __DATA__ section contains some sample HTML snippets, along with junk text to verify that it's only finding links. __DATA__部分包含一些示例HTML片段以及垃圾文本,以验证它是否仅查找链接。

Using a well tested module to parse your HTML is so much more durable than constructing fragile regular expressions to do the whole job. 使用经过良好测试的模块来解析HTML比构造脆弱的正则表达式来完成整个工作要耐用得多。 Many well-made parsing solutions include regular expressions under the hood, but only to do little bits and pieces of the work here and there. 许多制作精良的解析解决方案包括引擎盖下的正则表达式,但只是在这里和那里做一点点工作。 When you start relying on a regexp to do the parsing (as opposed to the identifying of small building blocks), you run out of gas quickly. 当您开始依赖正则表达式进行解析(而不是识别小的构建块)时,您会快速耗尽气体。

Have fun. 玩得开心。

sed doesn't have non-greedy matching, so your best bet is just to use perl : sed没有非贪心匹配,所以你最好的选择就是使用perl

perl -ne '/.*?(http.*?.torrent)/ && print "$1\n"' 1.html

The -n argument tells perl to read each line of input (from 1.html in this case, or from stdin if no file(s) are on the cmdline) and run something against each line... the -e gives the "something to execute" on the command line. -n参数告诉perl读取每一行输入(在这种情况下来自1.html,或者如果cmdline上没有文件,则来自stdin)并对每一行运行一些东西...... -e给出“ “在命令行上执行”。

The first part of the expression matches against the expression you were looking for, with the parentheses capturing your interesting bits into $1 . 表达式的第一部分与您正在寻找的表达式匹配,括号将您感兴趣的位捕获到$1 If it matches, it evaluates to true, and so will then execute the print (giving you your match along with a newline). 如果匹配,则计算结果为true,然后执行print(给你的匹配以及换行符)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM