简体   繁体   English

正则表达式可从任何带有url中特定单词的网页打印url

[英]regex to print url from any webpage with specific word in url

i am using below code to extract url from a webpage and its working just fine but i want to filter it. 我正在使用下面的代码从网页中提取URL,它的工作正常,但我想过滤它。 it will display all urls in that page but i want only those url which consists of the word "super" 它会显示该页面中的所有网址,但我只希望包含“ super”一词的网址

     $regex='|<a.*?href="(.*?)"|';
preg_match_all($regex,$result,$parts);
$links=$parts[1];
foreach($links as $link){
echo $link."<br>";

       }

so it should echo only uls where the word super is present. 因此,它应该只在出现单词super的地方回应ul。 for example it should ignore url 例如,它应该忽略网址

       http://xyz.com/abc.html  

but it should echo 但它应该回声

        http://abc.superpower.com/hddll.html

as it consists of the required word super in url 因为它由url中必需的单词super组成

Make your regex un-greedy and it should work: 使您的正则表达式不贪心,它应该可以工作:

$regex = '|<a.*?href="(.*?super[^"]*)"|is';

However to parse and scrap HTML it is better to use php's DOM parser. 但是,要解析和废弃HTML,最好使用php的DOM解析器。

Update: Here is code using DOM parser: 更新:这是使用DOM解析器的代码:

$request_url ='1900girls.blogspot.in/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$result = curl_exec($ch);

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    echo $node->getAttribute('href') . "\n";
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM