简体   繁体   中英

Wrong regular expression works

Why is this happening? regular expression ignores tag <a and goes to the previous tag <a

$url = 'urband.net';
$p = '%(.{0,5})<a\s+href=".*?';
$p .= $url;
$p .= '.*?"\s*>(.*?)</a>(.{0,5})%imm';

$s = file_get_contents("http://boringmachines.blogspot.com/2006/12/bitbin-herb-recordings.html");
$out = preg_match_all($p, $s, $matches, PREG_SET_ORDER);
print_r($matches);

I get array:

Array
(
    [0] => Array
        (
            [0] => /div><a href="http://photos1.blogger.com/x/blogger/1112/3281/1600/484028/aliasEPlined.jpg"><img style="FLOAT: left; MARGIN: 0px 10px 10px 0px; WIDTH: 162px; CURSOR: hand; HEIGHT: 149px" height="124" alt="" src="http://photos1.blogger.com/x/blogger/1112/3281/320/925013/aliasEPlined.jpg" width="199" border="0" /></a><span style="font-size:85%;">Due to last weeks bad weather here in Glasgow, I was unable to connect to the web and keep up those regular <a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=57230462">Herb Recordings </a>mp3's. Instead, I posted a <a href="http://boringmachines.blogspot.com/2006/11/bitbin-herb-recordings.html#links">video</a> of one of their earlier releases, BitBin. Thankfully, some good has came from thsoe storms, as Herb have kindly donated another mp3, in the form of "<em>May</em>" by BitBin.</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;"><a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&amp;friendID=26396670">BitBin</a> is a London based artist and had his "Alias" ep released by Herb earlier this year. He influences are both broad, and for and electronic producer, quite unusual. The likes of Brian Eno, Bola and Warp Records, sit side by side with Brian Wilson, Captain Beefheart and dEUS. His bio may explain a few things, as BitBin claims he is all about "<em>glitching his way through any field of music and reality</em>"</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;">"<em>May</em>" itself is an expansive and dark slice of electronica reminiscent of Bola and Gescom. For me, however, this is akin to the music Thom Yorke has been pushing Radiohead towards over the last few years. The beats echo those of "<em>Idioteque</em>", and believe, me that is no bad thing.</span><br /><span style="font-size:85%;"></span><br /><span style="font-size:85%;">The "Alias" ep can be ordered<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=57230462"> here</a>, however, the cd release will feature 3 extra tracks, "<em>making it, one longer trip</em>". An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with
            [1] => /div>
            [2] => interview and podcast
            [3] =>  with
        )

)

Although had to get:

Array
(
    [0] => Array
        (
            [0] => . An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with
            [1] => . An 
            [2] => interview and podcast
            [3] =>  with
        )

)

Welcome to the joys and wonders of using regexes on HTML. Try using DOM instead to find what you're looking for in the HTML.

An XPath query like //a[contains(@href,'urband.net')] would be far more accurate than the regex.

Try:

$url = 'urband\.net';
$p = '%(.{0,5})<a\s+href="[^"]*';
$p.= $url;
$p.= '[^"]*"\s*>(.*?)</a>(.{0,5})%imm';

edit - tested with Perl:

$/ = undef;

my $str = <DATA>;
my $count = 0;

while ($str =~ /(.{0,5})<a\s+href="[^"]*urband\.net[^"]*"\s*>(.*?)<\/a>(.{0,5})/sg)
{
   print "Array\n";
   print "(\n";
   print "    [$count] => Array\n";
   print "        (\n";
   print "            [0] => $&\n";
   print "            [1] => $1\n";
   print "            [2] => $2\n";
   print "            [3] => $3\n";
   print "        )\n";
   print "\n";
   print ")\n";
   ++$count;
}

Output:

Array
(
    [0] => Array
        (
            [0] => . An <a href="http://www.urband.net/interview/bitbin/index.html">interview and podcast</a> with
            [1] => . An
            [2] => interview and podcast
            [3] =>  with
        )

)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM