简体   繁体   中英

How can I parse this HTML with a regular expression?

I am trying to write a regular expression to extract the href and anchor text of a list of URLs from an HTML source. The anchor text can be any values.

The HTML part goes as follow:

<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>

I tried the following regular expression, but it's not working since it grabs everything before the </a> tag and fails.

preg_match_('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/', $source , $website_array);

What would be a working regular expression to extract my required data?

If you must know, the expression is greedy, so it will likely match the start of the first anchor and the end of the last; the /U modifier will fix that:

preg_match('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/U', $source , $website_array);

Note that pcre.backtrack_limit applies to ungreedy mode.

Using look-ahead sets might give better performance:

preg_match('/<a rel="nofollow" target="_blank" href="([^"]*)" class="see-all">([^<]*)<\/a>/', $source , $website_array);

This will have trouble with tags inside the anchor itself.

With aforementioned limitations, I would seriously consider using a HTML parser:

$d = new DOMDocument;
$d->loadHTML($source);
$xp = new DOMXPath($d);
foreach ($xp->query('//a[@class="see-all"][@rel="nofollow"][@target="_blank"]') as $anchor) {
    $href = $anchor->getAttribute('href');
    $text = $anchor->nodeValue;
}

Demo

This would happily handle the attributes in a different order and give you the ability to query further inside, etc.

Try

preg_match_all('/<a[^>]+href="([^"]+)"[^>]*>([^>]+)<\/a>/is', $source , $website_array);

it will match all links and return an array with info. Notes:

[^"] - matches any character except "

While parsing HTML with regex is generally a bad idea (I would suggest looking at DOMDocument class for better solution), it can be used in some cases where you have a VERY specific idea of what you are trying to extract and can be assured that in all cases, that variable text won't actually break your regex.

For your case, you might try:

$pattern = '#<a rel="nofollow" target="_blank" href="(.*)" class="get-all">(.*)</a>#U';
preg_match_all($pattern, $source, $website_array);

Note the ungreedy modifier ( U ) at the end. That is very important to only match the smallest match possible.

Alternatively you can do it like this:

<?php
$html = <<<HTML
<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>
HTML;


$xml = new DOMDocument();
@$xml->loadHTML($html);

$links=array();
$i=0;
//Get all divs
foreach($xml->getElementsByTagName('div') as $divs) {
    //if this div has a class="links"
    if($divs->getAttribute('class')=='links'){
        //loop through this div
        foreach($xml->getElementsByTagName('a') as $a){
            //if this a tag dose not have a class="get-all" continue to next
            if($a->getAttribute('class')!='get-all')
            continue;

            //Assign values to the links array
            $links[$i]['href']=$a->getAttribute('href');
            $links[$i]['value']=$a->nodeValue;
            $i++;
        }

    }
}

print_r($links);
/*
Array
(
    [0] => Array
        (
            [href] => http://url1.com
            [value] => URL1
        )

    [1] => Array
        (
            [href] => http://url2.com
            [value] => This is Url-2
        )

    [2] => Array
        (
            [href] => http://url3.com
            [value] => This is Url-3
        )

    [3] => Array
        (
            [href] => http://url4.com
            [value] => Sweet URL 4
        )

)
*/
?>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM