简体   繁体   English

php regex获取href标记内的字符串

[英]php regex to get string inside href tag

I need a regex that will give me the string inside an href tag and inside the quotes also. 我需要一个正则表达式,可以在href标签和引号内为我提供字符串。

For example i need to extract theurltoget.com in the following: 例如,我需要在以下位置提取theurltoget.com:

<a href="theurltoget.com">URL</a>

Additionally, I only want the base url part. 另外,我只想要基本网址部分。 Ie from http://www.mydomain.com/page.html i only want http://www.mydomain.com/ 即来自http://www.mydomain.com/page.html我只希望http://www.mydomain.com/

Dont use regex for this. 不要为此使用正则表达式。 You can use xpath and built in php functions to get what you want: 您可以使用xpath和内置的php函数来获取所需的内容:

    $xml = simplexml_load_string($myHtml);
    $list = $xml->xpath("//@href");

    $preparedUrls = array();
    foreach($list as $item) {
        $item = parse_url($item);
        $preparedUrls[] = $item['scheme'] . '://' .  $item['host'] . '/';
    }
    print_r($preparedUrls);
$html = '<a href="http://www.mydomain.com/page.html">URL</a>';

$url = preg_match('/<a href="(.+)">/', $html, $match);

$info = parse_url($match[1]);

echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com

this expression will handle 3 options: 此表达式将处理3个选项:

  1. no quotes 无引号
  2. double quotes 双引号
  3. single quotes 单引号

'/href=["\\']?([^"\\'>]+)["\\']?/' '/ href = [“ \\']?([^^ \\'>] +)[” \\']?/'

Use the answer by @Alec if you're only looking for the base url part (the 2nd part of the question by @David)! 如果您只在寻找基本网址部分 (@David问题的第二部分),请使用@Alec的答案!

$html = '<a href="http://www.mydomain.com/page.html" class="myclass" rel="myrel">URL</a>';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);

This will give you: 这将为您提供:

$info
Array
(
    [scheme] => http
    [host] => www.mydomain.com
    [path] => /page.html" class="myclass" rel="myrel
)

So you can use $href = $info["scheme"] . "://" . $info["host"] 因此,您可以使用$href = $info["scheme"] . "://" . $info["host"] $href = $info["scheme"] . "://" . $info["host"] $href = $info["scheme"] . "://" . $info["host"] Which gives you: $href = $info["scheme"] . "://" . $info["host"]可为您提供:

// http://www.mydomain.com  

When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by @user2520237. 当您在href之间查找整个URL时 ,您应该使用另一个正则表达式,例如@ user2520237提供的正则表达式。

$html = '<a href="http://www.mydomain.com/page.html" class="myclass" rel="myrel">URL</a>';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);

this will give you: 这将为您提供:

$info
Array
(
    [scheme] => http
    [host] => www.mydomain.com
    [path] => /page.html
)

Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"]; 现在,您可以使用$href = $info["scheme"] . "://" . $info["host"] . $info["path"]; $href = $info["scheme"] . "://" . $info["host"] . $info["path"]; Which gives you: 这给你:

// http://www.mydomain.com/page.html

http://www.the-art-of-web.com/php/parse-links/ http://www.the-art-of-web.com/php/parse-links/

Let's start with the simplest case - a well formatted link with no extra attributes: 让我们从最简单的情况开始-格式正确的链接,没有额外的属性:

/<a href=\"([^\"]*)\">(.*)<\/a>/iU

For all href values replacement: 对于所有href值替换:

function replaceHref($html, $replaceStr)
{
    $match = array();
    $url   = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);

    if(count($match))
    {
        for($j=0; $j<count($match); $j++)
        {
            $html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
        }
    }
    return $html;
}
$replaceStr  = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);

echo $replaceHtml;

This will handle the case where there are no quotes around the URL. 这将处理URL周围没有引号的情况。

/<a [^>]*href="?([^">]+)"?>/

But seriously, do not parse HTML with regex . 但是请注意, 不要使用regex解析HTML Use DOM or a proper parsing library. 使用DOM或适当的解析库。

Because Positive and Negative Lookbehind are cool 因为正向和负向落后很酷

/(?<=href=\").+(?=\")/

It will match only what you want, without quotation marks 它只会匹配您想要的内容,不带引号

Array ( [0] => theurltoget.com ) 数组([0] => theurltoget.com)

/href="(https?://[^/]*)/

我认为您应该能够处理其余的工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM