简体   繁体   中英

get all <a> tags href in page with php

i am trying to get all external links in one web page and store it in database. i put all web page contents in variable:

$pageContent = file_get_contents("http://sample-site.org");

how i can save all external links??

for example if web page has a code such as:

<a href="http://othersite.com">other site</a>

i want to save http://othersite.com in database. in the other words i want to make a crawler that store all external links exists in one web page. how i can do this?

You could use PHP Simple HTML DOM Parser 's find method:

require_once("simple_html_dom.php");
$pageContent = file_get_html("http://sample-site.org");
foreach ($pageContent->find("a") as $anchor)
    echo $anchor->href . "<br>";

I would suggest using DOMDocument() and DOMXPath() . This allows the result to only contain external links as you've requested.

As a note. If you're going to crawl websites, you will more likely want to use cURL , but I will continue with file_get_contents() as that's what you're using in this example. cURL would allow you to do things like set a user agent, headers, store cookies, etc. and appear more like a real user. Some websites will attempt to prevent robots.

$html = file_get_contents("http://example.com");

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

// Only pull back A tags with an href attribute starting with "http".
$res = $xp -> query('//a[starts-with(@href, "http")]/@href');

if ($res -> length > 0)
{
    foreach ($res as $node)
    {
        echo "External Link: " . $node -> nodeValue . "\n";
    }
}
else
    echo "There were no external links found.";

/*
 * Output:
 *  External Link: http://www.iana.org/domains/example
 */

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM