简体   繁体   English

PHP正则表达式匹配URL,但不匹配图像

[英]PHP Regex Match URL but don't match Image

I'm trying to use preg_match_all to extract all urls from a block of HTML code. 我正在尝试使用preg_match_all从HTML代码块中提取所有URL。 I'm also trying to ignore all images. 我也试图忽略所有图像。

Example HTML block: 示例HTML块:

$html = '<p>This is a test</p><br>http://www.facebook.com<br><img src="http://www.google.com/photo.jpg">www.yahoo.com https://www.aol.com<br>';

I'm using the following to try and build an array of URLS only. 我正在使用以下内容尝试仅构建URLS数组。 (not images) (非图片)

if(preg_match_all('~(?:(?:https://)|(?:http://)|(?:www\.))(?![^" ]*(?:jpg|png|gif|"))[^" <>]+~', $html, $links))
{ 
 print_r($links); 
}

In the example above the $links array should contain: 在上面的示例中,$ links数组应包含:

http://www.facebook.com, www.yahoo.com, https://www.aol.com 

Google is left out because it contains the .jpg image extension. Google被遗漏了,因为它包含.jpg图片扩展名。 The problem occurs when I add an image like this one to $html: 当我将这样的图像添加到$ html时,会发生问题:

<img src="http://www.google.com/image%201.jpg">

It seems as though the percent sign causes preg_match to break apart the URL and extract the following "link". 似乎百分号导致preg_match分解URL并提取以下“链接”。

http://www.google.com/image 

Any idea how to grab ONLY url's that are not images? 任何想法如何只抓住不是图像的网址吗? (even if they contain special characters that urls could commonly have) (即使它们包含网址通常可能具有的特殊字符)

Using DOM allows you to recognize the structure of an HTML document. 使用DOM可让您识别HTML文档的结构。 In your case to recognize the parts you want to fetch the urls from. 在您的情况下,您可以识别要从中获取网址的部分。

  1. Load the HTML using DOM 使用DOM加载HTML
  2. Fetch urls from link href attributes using Xpath (only if you want them, too) 使用Xpath从链接href属性中获取网址(也仅在需要时)
  3. Fetch text nodes from the DOM using Xpath 使用Xpath从DOM中获取文本节点
  4. Use RegEx on text node value to match urls 在文本节点值上使用RegEx来匹配URL

Here is an example implementation: 这是一个示例实现:

$html = <<<'HTML'
  <p>This is a test</p>
  <br>
  http://www.facebook.com
  <br>
  <img src="http://www.google.com/photo.jpg">
  www.yahoo.com 
  https://www.aol.com
  <a href="http://www.google.com">Link</a>
  <!-- http://comment.ingored.url -->
  <br>
HTML;

$urls = array();

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

// fetch urls from link href attributes
foreach ($xpath->evaluate('//a[@href]/@href') as $href) {
  $urls[] = $href->value;
}

// fetch urls inside text nodes
$pattern = '(
 (?:(?:https?://)|(?:www\.))
 (?:[^"\'\\s]+)
)xS';
foreach ($xpath->evaluate('/html/body//text()') as $text) {
  $matches = array();
  preg_match_all($pattern, $text->nodeValue, $matches);
  foreach ($matches[0] as $href) {
    $urls[] = $href;
  }
}

var_dump($urls);

Output: 输出:

array(4) {
  [0]=>
  string(21) "http://www.google.com"
  [1]=>
  string(23) "http://www.facebook.com"
  [2]=>
  string(13) "www.yahoo.com"
  [3]=>
  string(19) "https://www.aol.com"
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM