如何使用正则表达式解析此HTML？

Question

I am trying to write a regular expression to extract the href and anchor text of a list of URLs from an HTML source. 我正在尝试编写一个正则表达式，以从HTML源中提取URL列表的href和anchor文本。 The anchor text can be any values. anchor文本可以是任何值。

The HTML part goes as follow: HTML部分如下：

<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>

I tried the following regular expression, but it's not working since it grabs everything before the </a> tag and fails. 我尝试了以下正则表达式，但由于</a>标记之前的所有内容</a>并失败，因此它无法正常工作。

preg_match_('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/', $source , $website_array);

What would be a working regular expression to extract my required data? 提取所需数据的有效正则表达式是什么？

Answer 1

If you must know, the expression is greedy, so it will likely match the start of the first anchor and the end of the last; 如果您必须知道，该表达式是贪婪的，因此它很可能与第一个锚点的开始和最后一个锚点的结束匹配； the /U modifier will fix that: /U修饰符可以解决以下问题：

preg_match('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/U', $source , $website_array);

Note that pcre.backtrack_limit applies to ungreedy mode. 请注意pcre.backtrack_limit适用于非贪婪模式。

Using look-ahead sets might give better performance: 使用预读集可能会提供更好的性能：

preg_match('/<a rel="nofollow" target="_blank" href="([^"]*)" class="see-all">([^<]*)<\/a>/', $source , $website_array);

This will have trouble with tags inside the anchor itself. 这将对锚自身内部的标签造成麻烦。

With aforementioned limitations, I would seriously consider using a HTML parser: 由于上述限制，我将认真考虑使用HTML解析器：

$d = new DOMDocument;
$d->loadHTML($source);
$xp = new DOMXPath($d);
foreach ($xp->query('//a[@class="see-all"][@rel="nofollow"][@target="_blank"]') as $anchor) {
    $href = $anchor->getAttribute('href');
    $text = $anchor->nodeValue;
}

Demo 演示版

This would happily handle the attributes in a different order and give you the ability to query further inside, etc. 这将以不同的顺序愉快地处理属性，并使您能够在内部进行进一步查询，等等。

Answer 2

Try 尝试

preg_match_all('/<a[^>]+href="([^"]+)"[^>]*>([^>]+)<\/a>/is', $source , $website_array);

it will match all links and return an array with info. 它将匹配所有链接并返回包含信息的数组。 Notes: 笔记：

[^"] - matches any character except " [^“]-匹配除“

Answer 3

While parsing HTML with regex is generally a bad idea (I would suggest looking at DOMDocument class for better solution), it can be used in some cases where you have a VERY specific idea of what you are trying to extract and can be assured that in all cases, that variable text won't actually break your regex. 虽然使用regex解析HTML通常不是一个好主意（我建议您看一下DOMDocument类以获得更好的解决方案），但是在某些情况下，如果您对要提取的内容有非常特定的想法，可以放心使用它，并且可以确保在在所有情况下，该可变文本实际上都不会破坏您的正则表达式。

For your case, you might try: 对于您的情况，您可以尝试：

$pattern = '#<a rel="nofollow" target="_blank" href="(.*)" class="get-all">(.*)</a>#U';
preg_match_all($pattern, $source, $website_array);

Note the ungreedy modifier ( U ) at the end. 请注意最后的非贪婪修饰符（ U ）。 That is very important to only match the smallest match possible. 仅匹配最小匹配项非常重要。

Answer 4

Alternatively you can do it like this: 另外，您也可以这样：

<?php
$html = <<<HTML
<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>
HTML;


$xml = new DOMDocument();
@$xml->loadHTML($html);

$links=array();
$i=0;
//Get all divs
foreach($xml->getElementsByTagName('div') as $divs) {
    //if this div has a class="links"
    if($divs->getAttribute('class')=='links'){
        //loop through this div
        foreach($xml->getElementsByTagName('a') as $a){
            //if this a tag dose not have a class="get-all" continue to next
            if($a->getAttribute('class')!='get-all')
            continue;

            //Assign values to the links array
            $links[$i]['href']=$a->getAttribute('href');
            $links[$i]['value']=$a->nodeValue;
            $i++;
        }

    }
}

print_r($links);
/*
Array
(
    [0] => Array
        (
            [href] => http://url1.com
            [value] => URL1
        )

    [1] => Array
        (
            [href] => http://url2.com
            [value] => This is Url-2
        )

    [2] => Array
        (
            [href] => http://url3.com
            [value] => This is Url-3
        )

    [3] => Array
        (
            [href] => http://url4.com
            [value] => Sweet URL 4
        )

)
*/
?>

如何使用正则表达式解析此HTML？

问题描述

4 个解决方案

解决方案1
6 已采纳 2013-02-19 23:15:57

解决方案2
2 2013-02-19 23:16:30

解决方案3
1 2013-02-19 23:18:21

解决方案4
0 2013-02-19 23:24:39

如何使用正则表达式解析此HTML？

问题描述

4 个解决方案

解决方案1 6 已采纳 2013-02-19 23:15:57

解决方案2 2 2013-02-19 23:16:30

解决方案3 1 2013-02-19 23:18:21

解决方案4 0 2013-02-19 23:24:39

解决方案1
6 已采纳 2013-02-19 23:15:57

解决方案2
2 2013-02-19 23:16:30

解决方案3
1 2013-02-19 23:18:21

解决方案4
0 2013-02-19 23:24:39