需要帮助在Perl中形成正则表达式

Question

I need some suggestion in parsing a html content,need to extract the id of tag <\\a> inside a div, and store it into an variable specific variable. 我在解析html内容时需要一些建议，需要在div中提取标记<\\ a>的ID，并将其存储到特定于变量的变量中。 i have tried to make a regular expression for this but its getting the id of tag in all div. 我试图为此做一个正则表达式，但它在所有div中都获得了标签的ID。 i need to store the ids of tag<\\a> which is only inside a specific div . 我需要存储仅在特定div内的tag <\\ a>的ID。

The HTML content is HTML内容是

<div class="m_categories" id="part_one">
<ul>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10018">aaa</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10007">bbb</a>
</li>
.
.
.
</div>

<div class="m_categories hidden" id="part_two">
<ul>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10016">ccc</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10011">ddd</a>
</li>
<li>-
<a href="#" class="sel_cat " id="sel_cat_10025">eee</a>
</li>
.
.
</div>

Need some suggestion, Thanks in advance 需要一些建议，谢谢

update: the regex i have used 更新：我用过的正则表达式

if($content=~m/sel_cat " id="([^<]*?)"/is){} if（$ content =〜m / sel_cat“ id =”（[^ <] *？）“ / is）{}

while($content=~m/sel_cat " id="([^<]*?)"/igs){} while（$ content =〜m / sel_cat“ id =”（[^ <] *？）“ / igs）{}

Answer 1

You should really look into HTML::Parser rather than trying to use a regex to extract bits of HTML. 您应该真正研究HTML :: Parser，而不是尝试使用正则表达式来提取HTML的位。

one way to us it to extract the id element from each div tag would be: 我们从每个div标签中提取id元素的一种方法是：

# This parser only looks at opening tags
sub start_handler { 
my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq 'div') { # is it a div element?
        if($attr->{ id }) {  # does div have an id?
            print "div id found: ", $attr->{ id }, "\n";
        }       
}
}
my $html = &read_html_somehow() or die $!;

my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler );
$p->parse($html);

This is a lot more robust and flexible than a regex-based approach. 这比基于正则表达式的方法更加健壮和灵活。

Answer 2

There are so many great HTML parser around. 周围有很多很棒的HTML解析器。 I kind of like the Mojo suite, which allows me to use CSS selectors to get a part of the DOM: 我有点像Mojo套件，它允许我使用CSS选择器来获取DOM的一部分：

use Mojo;

my $dom = Mojo::DOM->new($html_content);

say for $dom->find('a.sel_cat')->all_text;
# Or, more robust:
# say $_->all_text for $dom->find('a.sel_cat')->each;

Output: 输出：

aaa
bbb
ccc
ddd
eee

Or for the IDs: 或对于ID：

say for $dom->find('a.sel_cat')->attr('id');
# Or, more robust_
# say $_->attr('id') for $dom->find('a.sel_cat')->each;

Output: 输出：

sel_cat_10018
sel_cat_10007
sel_cat_10016
sel_cat_10011
sel_cat_10025

If you only want those ids in the part_two div, use the selector #part_two a.sel_cat . 如果仅在part_two div中需要这些ID，请使用选择器#part_two a.sel_cat 。

需要帮助在Perl中形成正则表达式

问题描述

2 个解决方案

解决方案1
2 2013-08-30 19:19:41

解决方案2
1 已采纳 2013-08-30 19:39:26

需要帮助在Perl中形成正则表达式

问题描述

2 个解决方案

解决方案1 2 2013-08-30 19:19:41

解决方案2 1 已采纳 2013-08-30 19:39:26

解决方案1
2 2013-08-30 19:19:41

解决方案2
1 已采纳 2013-08-30 19:39:26