<a>使用PHP</a>獲取<a>頁面中的</a>所有<a>標簽href</a>

Question

我試圖在一個網頁上獲取所有外部鏈接，並將其存儲在數據庫中。 我將所有網頁內容放入變量中：

$pageContent = file_get_contents("http://sample-site.org");

我如何保存所有外部鏈接？

例如，如果網頁具有以下代碼：

<a href="http://othersite.com">other site</a>

我想將http://othersite.com保存在數據庫中。 換句話說，我要使一個存儲所有外部鏈接的搜尋器存在於一個網頁中。 我該怎么做？

Answer 1

您可以使用PHP Simple HTML DOM Parser的find方法：

require_once("simple_html_dom.php");
$pageContent = file_get_html("http://sample-site.org");
foreach ($pageContent->find("a") as $anchor)
    echo $anchor->href . "<br>";

Answer 2

我建議使用DOMDocument（）和DOMXPath（）。 這樣可以使結果僅包含您所要求的外部鏈接。

作為說明。 如果您要爬網網站，則您更可能希望使用cURL ，但是我將繼續使用file_get_contents（），因為在此示例中，這就是您正在使用的內容。 通過cURL，您可以執行設置用戶代理，標題，存儲cookie等操作，並且看起來更像是真實用戶。 一些網站會嘗試阻止機器人。

$html = file_get_contents("http://example.com");

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

// Only pull back A tags with an href attribute starting with "http".
$res = $xp -> query('//a[starts-with(@href, "http")]/@href');

if ($res -> length > 0)
{
    foreach ($res as $node)
    {
        echo "External Link: " . $node -> nodeValue . "\n";
    }
}
else
    echo "There were no external links found.";

/*
 * Output:
 *  External Link: http://www.iana.org/domains/example
 */

<a>使用PHP</a>獲取<a>頁面中的</a>所有<a>標簽href</a>

問題描述

2 個解決方案

解決方案1
4 2018-05-21 19:13:26

解決方案2
0 2018-05-21 19:28:58

<a>使用PHP</a>獲取<a>頁面中的</a>所有<a>標簽href</a>

問題描述

2 個解決方案

解決方案1 4 2018-05-21 19:13:26

解決方案2 0 2018-05-21 19:28:58

解決方案1
4 2018-05-21 19:13:26

解決方案2
0 2018-05-21 19:28:58