简体   繁体   中英

php preg_match_all backreference

for the following input string , pattern and :

$str1 = 'span class="outline">Iron Man butts heads with Nick Fury and Shield after HYDRA attacks a meeting of the United Nations.</span>
<span class="credit">
    Dir: <a href="/name/nm0381817/">Vinton Heuck</a>, <a href="/name/nm1367649/">Ciro Nieli</a>, <a href="/name/nm1367649/">Aditya Parikh</a>'

$pattern='/class=&quot;credit&quot;&gt;[\s]+?Dir:([,\s]+?&lt;a[\s]+?href=&quot;\/name\/nm\d{7}\/&quot;&gt;([\/\(\)-:@!%*#=_|?$&;.\w\s]+?)&lt;\/a&gt;)+/um';

preg_match_all($pattern,$str1,$dir);

Output is as follows for print_r:

Array ( [0] => Array ( [0] => class="credit"> Dir: <a href="/name/nm0381817/">Vinton Heuck</a>, <a href="/name/nm1367649/">Ciro Nieli</a>, <a href="/name/nm1367649/">Aditya Parikh</a> ) [1] => Array ( [0] => , <a href="/name/nm1367649/">Aditya Parikh</a> ) [2] => Array ( [0] => Aditya Parikh ) )

As you can see the Array[2] gives Aditya Parikh, I was hoping to receieve Vinton Heuck and Ciro Nieli also. But didn't.

Any solution ??

The logic behind the matching array returned by preg_match_all is not that obvious.

First of all, don't use regex to parse html. With that said:

The result you're getting is on the form of $array[paren_num][match_num] .

A basic example: abc ran against the regex /(.)/ would return the following matches array:

Array
(
    [0] => Array
        (
            [0] => a
            [1] => b
            [2] => c
        )

    [1] => Array
        (
            [0] => a
            [1] => b
            [2] => c
        )

)

the index 0 contains all the consumed data. Index 1 means it's the first backreference (we only have 1 parenthesis). The 0-2 index inside of that corresponds to each match. The expression was in other words ran 3 times until it finished.

I hope this helps.

You should really consider using a DOM parser. For example, this one . Regular expressions just cannot properly parse HTML.

However, here is why your approach does not work as expected:

You are using the same capturing group for all 3 names. But a capturing group has only one number, so you will always just get the last thing that was captured (the right-most name). But even if you did just match one name (arbitrarily far into the span tag) you would get a different problem:

Matches cannot overlap. Since all three matches you want would contain at least class="credit"> Dir: and some more common text, you cannot get all of them. You could solve this with a lookbehind assertion (because it is not part of match), but unfortunately PHP does not allow variable-length lookbehinds (which would be required). There are workarounds to solve this, but at the end of the day, you are best of using a DOM parser.

Here is just a basic example using the parser I linked above:

require "simple_html_dom.php";

$html = str_get_html($str1);

$names = array();
foreach($html->find('span[class=credit] a') as $link)
    $names[] = $link->innertext;

print_r($names);

Resulting in:

Array
(
    [0] => Vinton Heuck
    [1] => Ciro Nieli
    [2] => Aditya Parikh
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM