[英]PHP regex - empty matches
我正在尝试从字符串中提取(通过CURL获取的整个网站源-)
<tr>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>
我想让所有3个字符的锚点在数组中匹配,例如AAL
和AAT
(还有更多)
我所拥有的是:
$subject = curl_exec($ch);
$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
print_r($matches);
结果我得到了
Array ( [0] => Array ( ) )
您能给我任何解决建议吗?
您可以使用DOMDocument
对象来构建数组,如下所示:
$doc = new DOMDocument();
$doc->LoadHTML($str);
$matches = array();
foreach($doc->getElementsByTagName('a') as $a) {
$text = $a->nodeValue;
if(strlen($text) === 3) $matches[] = $text;
}
这将遍历HTML字符串中的所有锚元素并构建以下数组:
Array
(
[0] => AAL
[1] => AAT
)
我只是尝试了您的示例,并且您的正则表达式可以通过提供的小样本按预期工作:
$subject = <<<EOT
<tr>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>
EOT;
$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
echo '<pre>';
print_r($matches);
echo '</pre>';
结果:
Array
(
[0] => Array
(
[0] => AAL
[1] => AAT
)
)
但这就是说,我实际上是挖出了我认为是curl
请求的源URL ,并且在我测试它时失败了。 所以我将正则表达式调整为:
/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is
现在,事情似乎可以与我的代码一起很好地工作,这些代码试图重新创建您正在执行的curl
请求。
// Set the URL.
$url="http://www.gpw.pl/lista_spolek_en";
// The actual curl request.
$curl_timeout = 5;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$subject = curl_exec($ch);
curl_close($ch);
// Set the regex pattern.
$pattern = '/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is';
// Run the preg match all command with the regex pattern.
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
// Return the results.
echo '<pre>';
print_r($matches);
echo '</pre>';
从我的角度来看,从中看来输出效果很好:
Array
(
[0] => Array
(
[0] => AAL
[1] => AAT
[2] => ABC
[3] => ABE
[4] => ABM
[5] => ABS
[6] => ACE
[7] => ACG
[8] => ACP
[9] => ACS
[10] => ACT
[11] => ADS
[12] => AGO
[13] => AGT
[14] => ALC
[15] => ALM
[16] => ALR
[17] => ALT
[18] => AMB
[19] => AMC
[20] => APL
[21] => APN
[22] => APT
[23] => ARC
[24] => ARR
[25] => ASB
[26] => ASE
[27] => ASG
[28] => AST
[29] => ATC
[30] => ATD
[31] => ATG
[32] => ATL
[33] => ATM
[34] => ATP
[35] => ATR
[36] => ATS
[37] => AWB
[38] => AWG
[39] => EAT
[40] => ACP
[41] => ALR
[42] => BZW
[43] => EUR
[44] => JSW
[45] => KER
[46] => KGH
[47] => LPP
[48] => LTS
[49] => LWB
[50] => MBK
[51] => OPL
[52] => PEO
[53] => PGE
[54] => PGN
[55] => PKN
[56] => PKO
[57] => PZU
[58] => SNS
[59] => TPE
)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.