这样做会有性能上的好处吗？ PHP问题

Question

I'm creating a site spider that grabs all the links from a web page as well as that page's html source code. 我正在创建一个站点蜘蛛，它可以捕获来自网页的所有链接以及该页面的html源代码。 It then checks all the links it has found and keeps only the internal ones. 然后，它将检查找到的所有链接，并仅保留内部链接。 Next it goes to each of those internal pages and repeats the above process. 接下来，转到这些内部页面中的每个页面，并重复上述过程。

Basically it's job is to crawl all the pages under a specified domain and grab each page's source. 基本上，它的工作是对指定域下的所有页面进行爬网并获取每个页面的源。 Now the reason for this is, I want to run some checks to see if this or that keyword is found on any of the pages as well as to list each page's meta information. 现在，这样做的原因是，我想运行一些检查以查看是否在任何页面上找到此关键字或该关键字，并列出每个页面的元信息。

I would like to know if I should run these checks on the html during the crawling phase of each page or if I should save all the html in an array for example and run the checks all the way at the end. 我想知道是否应该在每个页面的爬网阶段在html上运行这些检查，或者是否应该将所有html保存在一个数组中，例如，最后一直运行检查。 Which would be better performance wise? 哪种性能更好？

Answer 1

Seems like you very well may run into memory issues if you try to save all the data (in memory) for processing later. 如果您尝试将所有数据（保存在内存中）供以后处理，似乎您可能很好地遇到了内存问题。 You may be able to use the curl_multi_* functions to efficiently process while fetching. 您可能可以使用curl_multi_*函数在获取时进行有效处理。

Answer 2

You should use either phpQuery or QueryPath or one of the alternatives listed here: How do you parse and process HTML/XML in PHP? 您应该使用phpQuery或QueryPath或此处列出的替代方法之一：如何在PHP中解析和处理HTML / XML？

This simplifies fetching the pages, as well as extracting the links. 这简化了获取页面以及提取链接的过程。 Basically you just need something like: 基本上，您只需要以下内容：

$page = qp("http://example.org/");   // QueryPath

foreach ($page->find("a") as $link) {
     print $link->attr("href");
     // test if local link, then fetch next page ...
}

phpQuery has some more functions which simplify crawling (turning local links into absolute urls, etc). phpQuery还有更多功能可以简化抓取（将本地链接转换为绝对URL等）。 But you'll have to consult the documentation. 但是您必须查阅文档。 And you might also need a better appraoch for recursion, maybe a page/url stack to work on: 您可能还需要一个更好的递归方法，也许要使用一个页面/ URL堆栈：

$pool = array();
$pool[] = "http://example.com/first-url.html";   // to begin with

while ($url = array_pop($pool)) {
    // fetch
    // add found links to $pool[] = ...
    // (but also make a $visited[] list, to avoid neverending loop)
}

It's something you shouldn't want to overoptimize. 您不应该对此进行过度优化。 Run it as standalone script, and process each page individually. 作为独立脚本运行它，并分别处理每个页面。

这样做会有性能上的好处吗？ PHP问题

问题描述

2 个解决方案

解决方案1
0 2011-02-20 03:46:42

解决方案2
0 2011-02-20 03:55:44

这样做会有性能上的好处吗？ PHP问题

问题描述

2 个解决方案

解决方案1 0 2011-02-20 03:46:42

解决方案2 0 2011-02-20 03:55:44

解决方案1
0 2011-02-20 03:46:42

解决方案2
0 2011-02-20 03:55:44