简体   繁体   English

PHP Crawler从服务器获取所有内存

[英]PHP Crawler takes all the Memory from server

I wrote a really simple PHP crawler, but I have problem with the memory loss. 我写了一个非常简单的PHP搜寻器,但是内存丢失有问题。 The code is: 代码是:

<?php
require_once 'db.php';

$homepage = 'https://example.com';
$query = "SELECT * FROM `crawled_urls`";
$response = @mysqli_query($dbc, $query);

$already_crawled = [];
$crawling = [];

while($row = mysqli_fetch_array($response)){
  $already_crawled[] = $row['crawled_url'];
  $crawling[] = $row['crawled_url'];
}

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

follow_links($homepage);

Can you help me out and share with me a way to avoid this huge memory loss? 您能帮助我并与我分享避免这种巨大内存丢失的方法吗? When I start the process it is all working fine, but the memory is steadily rising up to 100%. 当我开始该过程时,一切正常,但是内存稳定地上升到100%。

You need to unset $doc when you no longer need it: 当不再需要$doc时,您需要unset它:

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  unset($doc);

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

follow_links($homepage);

Explanation: You are using recursion, that is, you are using a stack of functions basically. 说明:您正在使用递归,也就是说,您基本上是在使用函数堆栈。 This means that if you have a stack of 20 elements, all the resources for all the functions in your stack will be allocated accordingly. 这意味着,如果您有20个元素的堆栈,那么堆栈中所有功能的所有资源都会相应分配。 The deeper this gets the more memory you use. 深度越大,您使用的内存就越多。 $doc is the main problem, but you may want to look at the usage of your other variables and make sure nothing unneeded is allocated when you call the function again. $doc是主要问题,但是您可能要查看其他变量的用法,并确保在再次调用该函数时没有分配不必要的内容。

Try to unset the $doc variable before calling the function: 在调用函数之前,尝试unset $doc变量:

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);
  unset($doc);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

The main problem of your code is that you are using recursion. 您的代码的主要问题是您正在使用递归。 That way, you are keeping old pages in memory, although you don't need them any more. 这样,您就可以将旧页面保留在内存中,尽管您不再需要它们。

Try removing that recursion. 尝试删除该递归。 It should be relatively easy, since you are already using lists to store your links. 这应该相对容易,因为您已经在使用列表来存储链接。 I would prefer to use one list and represent URLs as objects, however. 但是,我希望使用一个列表并将URL表示为对象。

Some other things: 其他一些事情:

  • Looks like you have a SQL injection vulnerability, so learn to use prepared statements correctly 看起来您有一个SQL注入漏洞,因此请学习正确使用准备好的语句
  • Avoid using global variables (you can make your function return a list of links) 避免使用全局变量(您可以使函数返回链接列表)
  • If you plan to use this code on other people's websites ensure you obey robots.txt , limit your crawl rate and make sure you don't crawl pages multiple times 如果您打算在其他人的网站上使用此代码,请确保您遵守robots.txt ,限制抓取速度并确保不多次抓取页面

If you want to use this code for something other than education, I suggest you use a library. 如果您想将此代码用于教育以外的其他用途,建议您使用一个库。 This will be easier than creating a crawler from scratch. 这比从头开始创建搜寻器要容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM