Perl使用正则表达式提取html标记之间的文本

Question

I'm new to Perl and im trying to extract the text between all <li> </li> tags in a string and assign them into an array using regex or split/join. 我是Perl的新手，我试图提取字符串中所有<li> </li>标记之间的文本，然后使用regex或split / join将它们分配到数组中。

eg 例如

my $string = "<ul>
                  <li>hello</li>
                  <li>there</li>
                  <li>everyone</li>
              </ul>";

So that this code... 这样的代码...

foreach $value(@array){
    print "$value\n";
}

...results in this output: ...结果如下：

hello
there
everyone

Answer 1

Note: Do not use regular expressions to parse HTML. 注意：不要使用正则表达式来解析HTML。

This first option is done using HTML::TreeBuilder , one of many HTML Parsers that is available to use. 第一个选项是使用HTML :: TreeBuilder完成的， HTML :: TreeBuilder是许多可用的HTML解析器之一。 You can visit the link provided above and read the documentation and see the example's that are given. 您可以访问上面提供的链接，阅读文档并查看给出的示例。

use strict;
use warnings;
use HTML::TreeBuilder;

my $str 
   = "<ul>"
   . "<li>hello</li>"
   . "<li>there</li>"
   . "<li>everyone</li>"
   . "</ul>"
   ;

# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($str);

# And now find all <li> tags and create an array with the values.
my @lists = 
      map { $_->content_list } 
      $tr->find_by_tag_name('li');

# And loop through the array returning our values.
foreach my $val (@lists) {
   print $val, "\n";
}

If you decide you want to use a regular expression here (I don't recommend). 如果您决定要在此处使用正则表达式（我不建议）。 You could do something like.. 你可以做类似..

my $str
   = "<ul>"
   . "<li>hello</li>"
   . "<li>there</li>"
   . "<li>everyone</li>"
   . "</ul>"
   ;

my @matches;
while ($str =~/(?<=<li>)(.*?)(?=<\/li>)/g) {
  push @matches, $1;
}

foreach my $m (@matches) {
   print $m, "\n";
}

Output: 输出：

hello
there
everyone

Answer 2

Note: Do not use regular expressions to parse HTML . 注意： 请勿使用正则表达式解析HTML 。

hwnd has already provided one HTML Parser solution. hwnd已经提供了一种HTML Parser解决方案。

However, for a more modern HTML Parser based off css selectors, you can check out Mojo::DOM . 但是，对于基于css选择器的更现代的HTML解析器，可以签出Mojo::DOM 。 There is a very informative 8 minute intro video at Mojocast episode 5 . Mojocast episode 5了一个非常Mojocast episode 5 8分钟的简介视频。

use strict;
use warnings;

use Mojo::DOM;

my $html = do {local $/; <DATA>};

my $dom = Mojo::DOM->new($html);

for my $li ($dom->find('li')->text->each) {
    print "$li\n";
}

__DATA__
<ul>
  <li>hello</li>
  <li>there</li>
  <li>everyone</li>
</ul>

Outputs: 输出：

hello
there
everyone

Perl使用正则表达式提取html标记之间的文本

问题描述

2 个解决方案

解决方案1
6 2013-09-24 01:19:55

解决方案2
1 2014-06-15 17:12:37

Perl使用正则表达式提取html标记之间的文本

问题描述

2 个解决方案

解决方案1 6 2013-09-24 01:19:55

解决方案2 1 2014-06-15 17:12:37

解决方案1
6 2013-09-24 01:19:55

解决方案2
1 2014-06-15 17:12:37