[英]perl extract text between html tags using regex
I'm new to Perl and im trying to extract the text between all <li> </li>
tags in a string and assign them into an array using regex or split/join. 我是Perl的新手,我试图提取字符串中所有<li> </li>
标记之间的文本,然后使用regex或split / join将它们分配到数组中。
eg 例如
my $string = "<ul>
<li>hello</li>
<li>there</li>
<li>everyone</li>
</ul>";
So that this code... 这样的代码...
foreach $value(@array){
print "$value\n";
}
...results in this output: ...结果如下:
hello
there
everyone
Note: Do not use regular expressions to parse HTML. 注意:不要使用正则表达式来解析HTML。
This first option is done using HTML::TreeBuilder , one of many HTML Parsers that is available to use. 第一个选项是使用HTML :: TreeBuilder完成的, HTML :: TreeBuilder是许多可用的HTML解析器之一。 You can visit the link provided above and read the documentation and see the example's that are given. 您可以访问上面提供的链接,阅读文档并查看给出的示例。
use strict;
use warnings;
use HTML::TreeBuilder;
my $str
= "<ul>"
. "<li>hello</li>"
. "<li>there</li>"
. "<li>everyone</li>"
. "</ul>"
;
# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($str);
# And now find all <li> tags and create an array with the values.
my @lists =
map { $_->content_list }
$tr->find_by_tag_name('li');
# And loop through the array returning our values.
foreach my $val (@lists) {
print $val, "\n";
}
If you decide you want to use a regular expression here (I don't recommend). 如果您决定要在此处使用正则表达式(我不建议)。 You could do something like.. 你可以做类似..
my $str
= "<ul>"
. "<li>hello</li>"
. "<li>there</li>"
. "<li>everyone</li>"
. "</ul>"
;
my @matches;
while ($str =~/(?<=<li>)(.*?)(?=<\/li>)/g) {
push @matches, $1;
}
foreach my $m (@matches) {
print $m, "\n";
}
Output: 输出:
hello
there
everyone
Note: Do not use regular expressions to parse HTML . 注意: 请勿使用正则表达式解析HTML 。
hwnd has already provided one HTML Parser solution. hwnd已经提供了一种HTML Parser解决方案。
However, for a more modern HTML Parser based off css selectors, you can check out Mojo::DOM
. 但是,对于基于css选择器的更现代的HTML解析器,可以签出Mojo::DOM
。 There is a very informative 8 minute intro video at Mojocast episode 5
. Mojocast episode 5
了一个非常Mojocast episode 5
8分钟的简介视频。
use strict;
use warnings;
use Mojo::DOM;
my $html = do {local $/; <DATA>};
my $dom = Mojo::DOM->new($html);
for my $li ($dom->find('li')->text->each) {
print "$li\n";
}
__DATA__
<ul>
<li>hello</li>
<li>there</li>
<li>everyone</li>
</ul>
Outputs: 输出:
hello
there
everyone
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.