简体   繁体   English

如何使用正则表达式解析此HTML?

[英]How can I parse this HTML with a regular expression?

I am trying to write a regular expression to extract the href and anchor text of a list of URLs from an HTML source. 我正在尝试编写一个正则表达式,以从HTML源中提取URL列表的hrefanchor文本。 The anchor text can be any values. anchor文本可以是任何值。

The HTML part goes as follow: HTML部分如下:

<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>

I tried the following regular expression, but it's not working since it grabs everything before the </a> tag and fails. 我尝试了以下正则表达式,但由于</a>标记之前的所有内容</a>并失败,因此它无法正常工作。

preg_match_('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/', $source , $website_array);

What would be a working regular expression to extract my required data? 提取所需数据的有效正则表达式是什么?

If you must know, the expression is greedy, so it will likely match the start of the first anchor and the end of the last; 如果您必须知道,该表达式是贪婪的,因此它很可能与第一个锚点的开始和最后一个锚点的结束匹配; the /U modifier will fix that: /U修饰符可以解决以下问题:

preg_match('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/U', $source , $website_array);

Note that pcre.backtrack_limit applies to ungreedy mode. 请注意pcre.backtrack_limit适用于非贪婪模式。

Using look-ahead sets might give better performance: 使用预读集可能会提供更好的性能:

preg_match('/<a rel="nofollow" target="_blank" href="([^"]*)" class="see-all">([^<]*)<\/a>/', $source , $website_array);

This will have trouble with tags inside the anchor itself. 这将对锚自身内部的标签造成麻烦。

With aforementioned limitations, I would seriously consider using a HTML parser: 由于上述限制,我将认真考虑使用HTML解析器:

$d = new DOMDocument;
$d->loadHTML($source);
$xp = new DOMXPath($d);
foreach ($xp->query('//a[@class="see-all"][@rel="nofollow"][@target="_blank"]') as $anchor) {
    $href = $anchor->getAttribute('href');
    $text = $anchor->nodeValue;
}

Demo 演示版

This would happily handle the attributes in a different order and give you the ability to query further inside, etc. 这将以不同的顺序愉快地处理属性,并使您能够在内部进行进一步查询,等等。

Try 尝试

preg_match_all('/<a[^>]+href="([^"]+)"[^>]*>([^>]+)<\/a>/is', $source , $website_array);

it will match all links and return an array with info. 它将匹配所有链接并返回包含信息的数组。 Notes: 笔记:

[^"] - matches any character except " [^“]-匹配除“

While parsing HTML with regex is generally a bad idea (I would suggest looking at DOMDocument class for better solution), it can be used in some cases where you have a VERY specific idea of what you are trying to extract and can be assured that in all cases, that variable text won't actually break your regex. 虽然使用regex解析HTML通常不是一个好主意(我建议您看一下DOMDocument类以获得更好的解决方案),但是在某些情况下,如果您对要提取的内容有非常特定的想法,可以放心使用它,并且可以确保在在所有情况下,该可变文本实际上都不会破坏您的正则表达式。

For your case, you might try: 对于您的情况,您可以尝试:

$pattern = '#<a rel="nofollow" target="_blank" href="(.*)" class="get-all">(.*)</a>#U';
preg_match_all($pattern, $source, $website_array);

Note the ungreedy modifier ( U ) at the end. 请注意最后的非贪婪修饰符( U )。 That is very important to only match the smallest match possible. 仅匹配最小匹配项非常重要。

Alternatively you can do it like this: 另外,您也可以这样:

<?php
$html = <<<HTML
<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>
HTML;


$xml = new DOMDocument();
@$xml->loadHTML($html);

$links=array();
$i=0;
//Get all divs
foreach($xml->getElementsByTagName('div') as $divs) {
    //if this div has a class="links"
    if($divs->getAttribute('class')=='links'){
        //loop through this div
        foreach($xml->getElementsByTagName('a') as $a){
            //if this a tag dose not have a class="get-all" continue to next
            if($a->getAttribute('class')!='get-all')
            continue;

            //Assign values to the links array
            $links[$i]['href']=$a->getAttribute('href');
            $links[$i]['value']=$a->nodeValue;
            $i++;
        }

    }
}

print_r($links);
/*
Array
(
    [0] => Array
        (
            [href] => http://url1.com
            [value] => URL1
        )

    [1] => Array
        (
            [href] => http://url2.com
            [value] => This is Url-2
        )

    [2] => Array
        (
            [href] => http://url3.com
            [value] => This is Url-3
        )

    [3] => Array
        (
            [href] => http://url4.com
            [value] => Sweet URL 4
        )

)
*/
?>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM