繁体   English   中英

PHP正则表达式-空匹配

[英]PHP regex - empty matches

我正在尝试从字符串中提取(通过CURL获取的整个网站源-)

<tr>
    <td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>

我想让所有3个字符的锚点在数组中匹配,例如AALAAT (还有更多)

我所拥有的是:

$subject = curl_exec($ch);        
$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
print_r($matches);

结果我得到了

Array ( [0] => Array ( ) ) 

您能给我任何解决建议吗?

您可以使用DOMDocument对象来构建数组,如下所示:

$doc = new DOMDocument();
$doc->LoadHTML($str);

$matches = array();
foreach($doc->getElementsByTagName('a') as $a) {
    $text = $a->nodeValue;
    if(strlen($text) === 3) $matches[] = $text;
}

这将遍历HTML字符串中的所有锚元素并构建以下数组:

Array
(
    [0] => AAL
    [1] => AAT
)

我只是尝试了您的示例,并且您的正则表达式可以通过提供的小样本按预期工作:

$subject = <<<EOT
<tr>
    <td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>
EOT;

$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);

echo '<pre>';
print_r($matches);
echo '</pre>';

结果:

Array
(
    [0] => Array
        (
            [0] => AAL
            [1] => AAT
        )

)

但这就是说,我实际上是挖出了我认为是curl请求的源URL ,并且在我测试它时失败了。 所以我将正则表达式调整为:

/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is

现在,事情似乎可以与我的代码一起很好地工作,这些代码试图重新创建您正在执行的curl请求。

// Set the URL.
$url="http://www.gpw.pl/lista_spolek_en";

// The actual curl request.
$curl_timeout = 5;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$subject = curl_exec($ch);
curl_close($ch);

// Set the regex pattern.
$pattern = '/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is';

// Run the preg match all command with the regex pattern.
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);

// Return the results.
echo '<pre>';
print_r($matches);
echo '</pre>';

从我的角度来看,从中看来输出效果很好:

Array
(
    [0] => Array
        (
            [0] => AAL
            [1] => AAT
            [2] => ABC
            [3] => ABE
            [4] => ABM
            [5] => ABS
            [6] => ACE
            [7] => ACG
            [8] => ACP
            [9] => ACS
            [10] => ACT
            [11] => ADS
            [12] => AGO
            [13] => AGT
            [14] => ALC
            [15] => ALM
            [16] => ALR
            [17] => ALT
            [18] => AMB
            [19] => AMC
            [20] => APL
            [21] => APN
            [22] => APT
            [23] => ARC
            [24] => ARR
            [25] => ASB
            [26] => ASE
            [27] => ASG
            [28] => AST
            [29] => ATC
            [30] => ATD
            [31] => ATG
            [32] => ATL
            [33] => ATM
            [34] => ATP
            [35] => ATR
            [36] => ATS
            [37] => AWB
            [38] => AWG
            [39] => EAT
            [40] => ACP
            [41] => ALR
            [42] => BZW
            [43] => EUR
            [44] => JSW
            [45] => KER
            [46] => KGH
            [47] => LPP
            [48] => LTS
            [49] => LWB
            [50] => MBK
            [51] => OPL
            [52] => PEO
            [53] => PGE
            [54] => PGN
            [55] => PKN
            [56] => PKO
            [57] => PZU
            [58] => SNS
            [59] => TPE
        )

)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM