简体   繁体   English

正则表达式可查找HTML字符串中的所有路径

[英]Regex expression to find all paths in a HTML string

I have a string, with a htmlentities encoded HTML code. 我有一个字符串,带有htmlentities编码的HTML代码。

What I want to do is find all the paths in the document, between: 我要做的是找到文档之间的所有路径:

href="XXX", src="XXX". href =“ XXX”,src =“ XXX”。

I do have a regex expression that find all the links starting by http, https, ftp and file, and lest me iterate over it: 我确实有一个正则表达式表达式,该表达式查找所有以http,https,ftp和file开头的链接,以免我对其进行迭代:

"/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&@#\/%=~_|$?!:,.]*[A-Z0-9+&@#\/%=~_|$]/i"

Any idea? 任何想法?

Update: Doing it with a regex isn't reliable. 更新:使用正则表达式是不可靠的。 The src=".." or href=".." statement can be part of a comment or a javascript statement. src =“ ..”或href =“ ..”语句可以是注释或javascript语句的一部分。 To reliable obtain the links I would suggest to use XPath: 为了可靠地获取链接,我建议使用XPath:

<?php

$html = file_get_contents('http://stackoverflow.com/questions/14782334/regex-expression-to-find-all-paths-in-a-html-string/14782594#14782594');
$doc = new DOMDocument();
@$doc->loadHTML($html);
$selector = new DOMXPath($doc);

$result = $selector->query('//a/@href | //@src');
foreach($result as $link) {
    echo $link->value, PHP_EOL;
}

If using a regex I would try to grab the content between the = " of the href or src attribute. Here comes an example how to get the links from this page using a regex: 如果使用正则表达式我会尽力抢=之间的内容"在href或src属性的又来了一个例子,如何使用正则表达式来得到这个页面的链接:

<?php

$html = file_get_contents('http://stackoverflow.com/questions/14782334/regex-expression-to-find-all-paths-in-a-html-string');

preg_match_all('/href="(?P<href>.*)"|src="(?P<src>.*)"/U', $html, $m);
                                                        <--- note the U to make the 
                                                             pattern ungreedy
var_dump($m['href']);
var_dump($m['src']);

You could use the DOM to find all links in specific tags. 您可以使用DOM查找特定标记中的所有链接。 For example to get urls from anchor tags do something like this (untested, but it should point you in the right direction): 例如,要从定位标记获取网址,请执行以下操作(未经测试,但应将您指向正确的方向):

function findPaths($url)
{
   $dom = new DOMDocument();

   //$url of page to search, the "@' is there to suppress warnings
   @$dom->loadHTMLFile($url) 

   $paths = array();
   foreach($dom->getElementsByTagName('a') as $path)
   {
     $paths[] = array('url' => $path->getAttribute('href'), text => $path->nodeValue);
   }
   return $paths;
}

You could make it even easier using XPath to load and evaluate the DOM. 您可以使使用XPath加载和评估DOM更加容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM