for the following input string , pattern and :
$str1 = 'span class="outline">Iron Man butts heads with Nick Fury and Shield after HYDRA attacks a meeting of the United Nations.</span>
<span class="credit">
Dir: <a href="/name/nm0381817/">Vinton Heuck</a>, <a href="/name/nm1367649/">Ciro Nieli</a>, <a href="/name/nm1367649/">Aditya Parikh</a>'
$pattern='/class="credit">[\s]+?Dir:([,\s]+?<a[\s]+?href="\/name\/nm\d{7}\/">([\/\(\)-:@!%*#=_|?$&;.\w\s]+?)<\/a>)+/um';
preg_match_all($pattern,$str1,$dir);
Output is as follows for print_r:
Array ( [0] => Array ( [0] => class="credit"> Dir: <a href="/name/nm0381817/">Vinton Heuck</a>, <a href="/name/nm1367649/">Ciro Nieli</a>, <a href="/name/nm1367649/">Aditya Parikh</a> ) [1] => Array ( [0] => , <a href="/name/nm1367649/">Aditya Parikh</a> ) [2] => Array ( [0] => Aditya Parikh ) )
As you can see the Array[2] gives Aditya Parikh, I was hoping to receieve Vinton Heuck and Ciro Nieli also. But didn't.
Any solution ??
The logic behind the matching array returned by preg_match_all
is not that obvious.
First of all, don't use regex to parse html. With that said:
The result you're getting is on the form of $array[paren_num][match_num]
.
A basic example: abc
ran against the regex /(.)/
would return the following matches array:
Array
(
[0] => Array
(
[0] => a
[1] => b
[2] => c
)
[1] => Array
(
[0] => a
[1] => b
[2] => c
)
)
the index 0 contains all the consumed data. Index 1 means it's the first backreference (we only have 1 parenthesis). The 0-2 index inside of that corresponds to each match. The expression was in other words ran 3 times until it finished.
I hope this helps.
You should really consider using a DOM parser. For example, this one . Regular expressions just cannot properly parse HTML.
However, here is why your approach does not work as expected:
You are using the same capturing group for all 3 names. But a capturing group has only one number, so you will always just get the last thing that was captured (the right-most name). But even if you did just match one name (arbitrarily far into the span
tag) you would get a different problem:
Matches cannot overlap. Since all three matches you want would contain at least class="credit"> Dir:
and some more common text, you cannot get all of them. You could solve this with a lookbehind assertion (because it is not part of match), but unfortunately PHP does not allow variable-length lookbehinds (which would be required). There are workarounds to solve this, but at the end of the day, you are best of using a DOM parser.
Here is just a basic example using the parser I linked above:
require "simple_html_dom.php";
$html = str_get_html($str1);
$names = array();
foreach($html->find('span[class=credit] a') as $link)
$names[] = $link->innertext;
print_r($names);
Resulting in:
Array
(
[0] => Vinton Heuck
[1] => Ciro Nieli
[2] => Aditya Parikh
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.