[英]get all <a> tags href in page with php
i am trying to get all external links in one web page and store it in database. 我试图在一个网页上获取所有外部链接,并将其存储在数据库中。 i put all web page contents in variable: 我将所有网页内容放入变量中:
$pageContent = file_get_contents("http://sample-site.org");
how i can save all external links?? 我如何保存所有外部链接?
for example if web page has a code such as: 例如,如果网页具有以下代码:
<a href="http://othersite.com">other site</a>
i want to save http://othersite.com in database. 我想将http://othersite.com保存在数据库中。 in the other words i want to make a crawler that store all external links exists in one web page. 换句话说,我要使一个存储所有外部链接的搜寻器存在于一个网页中。 how i can do this? 我该怎么做?
You could use PHP Simple HTML DOM Parser 's find
method: 您可以使用PHP Simple HTML DOM Parser的find
方法:
require_once("simple_html_dom.php");
$pageContent = file_get_html("http://sample-site.org");
foreach ($pageContent->find("a") as $anchor)
echo $anchor->href . "<br>";
I would suggest using DOMDocument() and DOMXPath() . 我建议使用DOMDocument()和DOMXPath() 。 This allows the result to only contain external links as you've requested. 这样可以使结果仅包含您所要求的外部链接。
As a note. 作为说明。 If you're going to crawl websites, you will more likely want to use cURL , but I will continue with file_get_contents() as that's what you're using in this example. 如果您要爬网网站,则您更可能希望使用cURL ,但是我将继续使用file_get_contents() ,因为在此示例中,这就是您正在使用的内容。 cURL would allow you to do things like set a user agent, headers, store cookies, etc. and appear more like a real user. 通过cURL,您可以执行设置用户代理,标题,存储cookie等操作,并且看起来更像是真实用户。 Some websites will attempt to prevent robots. 一些网站会尝试阻止机器人。
$html = file_get_contents("http://example.com");
$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);
// Only pull back A tags with an href attribute starting with "http".
$res = $xp -> query('//a[starts-with(@href, "http")]/@href');
if ($res -> length > 0)
{
foreach ($res as $node)
{
echo "External Link: " . $node -> nodeValue . "\n";
}
}
else
echo "There were no external links found.";
/*
* Output:
* External Link: http://www.iana.org/domains/example
*/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.