Perl正則表達式從嵌套的html標記中提取值

Question

$match = q(<a href="#google"><h1><b>Google</b></h1></a>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";

輸出： Google</b></h1>

應該是： Google

無法在Perl中使用Regex從鏈接中提取值，它可能具有一個或多個嵌套：

<h1><b><i>Google</i></b></h1>

請嘗試以下方法：

1）<td> <a href="/wiki/Unix_shell" title="Unix shell"> Unix shell </a>

2）<a href="http://www.hp.com"> <h1> <b> HP </ b> </ h1> </a>

3）<ahref="/wiki/Generic_programming"title="通用編程">通用</a> </ td>）；

4）<a href="#cite_note-1"> <span> [</ span> 1 <span>] </ span> </a>

輸出：

Unix外殼

生命值

通用的

[1]

Answer 1

如評論中所述，不要使用正則表達式。 我特別喜歡Mojo suite ，它允許我使用CSS選擇器：

use Mojo;

my $dom = Mojo::DOM->new(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->at('a[href="#google"]')->all_text, "\n";

或使用HTML::TreeBuilder::XPath ：

use HTML::TreeBuilder::XPath;

my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->findvalue('//a[@href="#google"]'), "\n";

Answer 2

嘗試這個：

if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)

那應該“在href以及<b>...</b>標記之間的所有內容

相反，要獲得“最后一個>和第一個</之前的所有內容，可以使用

<a.*?href.*?>([^>]*?)<\/

Answer 3

~~對於這種簡單的情況，您可以使用：~~ 需求不再簡單，請查看@amon關於如何使用HTML解析器的答案。

/<a.*?>([^<]+)</

匹配開頭a標簽，然后匹配所有內容，直到找到>和<之間的內容。

盡管正如其他人所提到的，您通常應該使用HTML解析器。

echo '<td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>
<a href="http://www.hp.com"><h1><b>HP</b></h1></a>
<a href="/wiki/Generic_programming" title="Generic programming">generic</a></td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic

Answer 4

我想出了這個正則表達式，適用於您在PCRE下的所有采樣輸入。 此正則表達式等效於帶有尾遞歸模式（？1）*的常規語法*

（？<=>）（（？：\\ w +）（？：\\ s *））（？1）*

只需取返回數組的第一個元素，即array [0]

Perl正則表達式從嵌套的html標記中提取值

問題描述

4 個解決方案

解決方案1
5 2013-08-28 13:04:57

解決方案2
2 已采納 2013-08-28 13:01:25

解決方案3
0 2013-08-28 13:04:21

解決方案4
0 2014-05-26 16:40:10

Perl正則表達式從嵌套的html標記中提取值

問題描述

4 個解決方案

解決方案1 5 2013-08-28 13:04:57

解決方案2 2 已采納 2013-08-28 13:01:25

解決方案3 0 2013-08-28 13:04:21

解決方案4 0 2014-05-26 16:40:10

解決方案1
5 2013-08-28 13:04:57

解決方案2
2 已采納 2013-08-28 13:01:25

解決方案3
0 2013-08-28 13:04:21

解決方案4
0 2014-05-26 16:40:10