简体   繁体   English

管理长时间运行的 php 脚本的最佳方法?

[英]Best way to manage long-running php script?

I have a PHP script that takes a long time (5-30 minutes) to complete.我有一个 PHP 脚本需要很长时间(5-30 分钟)才能完成。 Just in case it matters, the script is using curl to scrape data from another server.以防万一,脚本使用 curl 从另一台服务器上抓取数据。 This is the reason it's taking so long;这就是它需要这么长时间的原因。 it has to wait for each page to load before processing it and moving to the next.它必须等待每个页面加载,然后再处理它并移动到下一个。

I want to be able to initiate the script and let it be until it's done, which will set a flag in a database table.我希望能够启动脚本并让它一直运行到它完成,这将在数据库表中设置一个标志。

What I need to know is how to be able to end the http request before the script is finished running.我需要知道的是如何能够在脚本完成运行之前结束 http 请求。 Also, is a php script the best way to do this?另外,php脚本是最好的方法吗?

Certainly it can be done with PHP, however you should NOT do this as a background task - the new process has to be dissociated from the process group where it is initiated.当然它可以用 PHP 来完成,但是你不应该把它作为后台任务来做——新进程必须从它启动的进程组中分离出来。

Since people keep giving the same wrong answer to this FAQ, I've written a fuller answer here:由于人们一直对这个常见问题给出相同的错误答案,我在这里写了一个更完整的答案:

http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html

From the comments:从评论:

The short version is shell_exec('echo /usr/bin/php -q longThing.php | at now');简短的版本是shell_exec('echo /usr/bin/php -q longThing.php | at now'); but the reasons "why", are a bit long for inclusion here.但是“为什么”的原因在这里包含了一些时间。

Update +12 years更新+12年

While this is still a good way to invoke a long running bit of code, it is good for security to limit or even disable the ability of PHP in the webserver to launch other executables.虽然这仍然是调用长时间运行的代码的好方法,但限制甚至禁用 Web 服务器中的 PHP 启动其他可执行文件的能力对安全性有好处。 And since this decouples the behaviour of the log running thing from that which started it, in many cases it may be more appropriate to use a daemon or a cron job.并且由于这将运行日志的行为与启动它的行为分离,因此在许多情况下,使用守护程序或 cron 作业可能更合适。

The quick and dirty way would be to use the ignore_user_abort function in php.快速而肮脏的方法是使用 php.ini 中的ignore_user_abort函数。 This basically says: Don't care what the user does, run this script until it is finished.这基本上是说:不要关心用户做什么,运行这个脚本直到它完成。 This is somewhat dangerous if it is a public facing site (because it is possible, that you end up having 20++ versions of the script running at the same time if it is initiated 20 times).如果它是一个面向公众的站点,这有点危险(因为如果启动 20 次,您最终可能会同时运行 20++ 版本的脚本)。

The "clean" way (at least IMHO) is to set a flag (in the db for example) when you want to initiate the process and run a cronjob every hour (or so) to check if that flag is set. “干净”的方式(至少恕我直言)是在您想要启动进程并每隔一小时(左右)运行一次 cronjob 以检查是否设置了该标志时设置一个标志(例如在数据库中)。 If it IS set, the long running script starts, if it is NOT set, nothin happens.如果设置了,则启动长时间运行的脚本,如果未设置,则不会发生任何事情。

You could use exec or system to start a background job, and then do the work in that.您可以使用execsystem启动后台作业,然后在其中完成工作。

Also, there are better approaches to scraping the web that the one you're using.此外,还有更好的方法来抓取您正在使用的网络。 You could use a threaded approach (multiple threads doing one page at a time), or one using an eventloop (one thread doing multiple pages at at time).您可以使用线程方法(多个线程一次处理一页),或者使用事件循环(一个线程一次处理多个页面)。 My personal approach using Perl would be using AnyEvent::HTTP .我个人使用 Perl 的方法是使用AnyEvent::HTTP

ETA: symcbean explained how to detach the background process properly here . ETA: symcbean 在这里解释了如何正确分离后台进程。

No, PHP is not the best solution.不,PHP 不是最好的解决方案。

I'm not sure about Ruby or Perl, but with Python you could rewrite your page scraper to be multi-threaded and it would probably run at least 20x faster.我不确定 Ruby 或 Perl,但使用 Python,您可以将页面抓取器重写为多线程,并且它的运行速度可能至少快 20 倍。 Writing multi-threaded apps can be somewhat of a challenge, but the very first Python app I wrote was mutlti-threaded page scraper.编写多线程应用程序可能有点挑战,但我编写的第一个 Python 应用程序是多线程页面抓取工具。 And you could simply call the Python script from within your PHP page by using one of the shell execution functions.您可以使用其中一个 shell 执行函数从 PHP 页面中简单地调用 Python 脚本。

Yes, you can do it in PHP.是的,你可以在 PHP 中做到这一点。 But in addition to PHP it would be wise to use a Queue Manager.但除了 PHP 之外,最好使用队列管理器。 Here's the strategy:这是策略:

  1. Break up your large task into smaller tasks.把你的大任务分解成小任务。 In your case, each task could be loading a single page.在您的情况下,每个任务都可能加载一个页面。

  2. Send each small task to the queue.将每个小任务发送到队列。

  3. Run your queue workers somewhere.在某处运行您的队列工作人员。

Using this strategy has the following advantages:使用此策略具有以下优点:

  1. For long running tasks it has the ability to recover in case a fatal problem occurs in the middle of the run -- no need to start from the beginning.对于长时间运行的任务,它可以在运行过程中出现致命问题时进行恢复——无需从头开始。

  2. If your tasks do not have to be run sequentially, you can run multiple workers to run tasks simultaneously.如果您的任务不必按顺序运行,您可以运行多个工作人员同时运行任务。

You have a variety of options (this is just a few):您有多种选择(这只是几个):

  1. RabbitMQ ( https://www.rabbitmq.com/tutorials/tutorial-one-php.html ) RabbitMQ ( https://www.rabbitmq.com/tutorials/tutorial-one-php.html )
  2. ZeroMQ ( http://zeromq.org/bindings:php ) ZeroMQ ( http://zeromq.org/bindings:php )
  3. If you're using the Laravel framework, queues are built-in ( https://laravel.com/docs/5.4/queues ), with drivers for AWS SES, Redis, Beanstalkd如果您使用 Laravel 框架,队列是内置的 ( https://laravel.com/docs/5.4/queues ),带有适用于 AWS SES、Redis、Beanstalkd 的驱动程序

PHP may or may not be the best tool, but you know how to use it, and the rest of your application is written using it. PHP 可能是也可能不是最好的工具,但您知道如何使用它,并且您的应用程序的其余部分都是使用它编写的。 These two qualities, combined with the fact that PHP is "good enough" make a pretty strong case for using it, instead of Perl, Ruby, or Python.这两个品质,再加上 PHP “足够好”这一事实,为使用它而不是 Perl、Ruby 或 Python 提供了一个非常有力的理由。

If your goal is to learn another language, then pick one and use it.如果您的目标是学习另一种语言,请选择一种并使用它。 Any language you mentioned will do the job, no problem.您提到的任何语言都可以完成这项工作,没问题。 I happen to like Perl, but what you like may be different.我碰巧喜欢 P​​erl,但你喜欢的可能不一样。

Symcbean has some good advice about how to manage background processes at his link. Symcbean 在他的链接上有一些关于如何管理后台进程的好建议。

In short, write a CLI PHP script to handle the long bits.简而言之,编写一个 CLI PHP 脚本来处理长位。 Make sure that it reports status in some way.确保它以某种方式报告状态。 Make a php page to handle status updates, either using AJAX or traditional methods.使用 AJAX 或传统方法制作一个 php 页面来处理状态更新。 Your kickoff script will the start the process running in its own session, and return confirmation that the process is going.您的启动脚本将启动在其自己的会话中运行的进程,并返回该进程正在进行的确认。

Good luck.祝你好运。

I realize this is a quite old question but would like to give it a shot.我意识到这是一个很老的问题,但想试一试。 This script tries to address both the initial kick off call to finish quickly and chop down the heavy load into smaller chunks.此脚本尝试解决初始启动调用以快速完成并将重负载切割成更小的块。 I haven't tested this solution.我没有测试过这个解决方案。

<?php
/**
 * crawler.php located at http://mysite.com/crawler.php
 */

// Make sure this script will keep on runing after we close the connection with
// it.
ignore_user_abort(TRUE);


function get_remote_sources_to_crawl() {
  // Do a database or a log file query here.

  $query_result = array (
    1 => 'http://exemple.com',
    2 => 'http://exemple1.com',
    3 => 'http://exemple2.com',
    4 => 'http://exemple3.com',
    // ... and so on.
  );

  // Returns the first one on the list.
  foreach ($query_result as $id => $url) {
    return $url;
  }
  return FALSE;
}

function update_remote_sources_to_crawl($id) {
  // Update my database or log file list so the $id record wont show up
  // on my next call to get_remote_sources_to_crawl()
}

$crawling_source = get_remote_sources_to_crawl();

if ($crawling_source) {


  // Run your scraping code on $crawling_source here.


  if ($your_scraping_has_finished) {
    // Update you database or log file.
    update_remote_sources_to_crawl($id);

    $ctx = stream_context_create(array(
      'http' => array(
        // I am not quite sure but I reckon the timeout set here actually
        // starts rolling after the connection to the remote server is made
        // limiting only how long the downloading of the remote content should take.
        // So as we are only interested to trigger this script again, 5 seconds 
        // should be plenty of time.
        'timeout' => 5,
      )
    ));

    // Open a new connection to this script and close it after 5 seconds in.
    file_get_contents('http://' . $_SERVER['HTTP_HOST'] . '/crawler.php', FALSE, $ctx);

    print 'The cronjob kick off has been initiated.';
  }
}
else {
  print 'Yay! The whole thing is done.';
}

I agree with the answers that say this should be run in a background process.我同意说这应该在后台进程中运行的答案。 But it's also important that you report on the status so the user knows that the work is being done.但报告状态也很重要,以便用户知道工作正在完成。

When receiving the PHP request to kick off the process, you could store in a database a representation of the task with a unique identifier.当接收到启动进程的 PHP 请求时,您可以在数据库中存储具有唯一标识符的任务表示。 Then, start the screen-scraping process, passing it the unique identifier.然后,启动屏幕抓取过程,将唯一标识符传递给它。 Report back to the iPhone app that the task has been started and that it should check a specified URL, containing the new task ID, to get the latest status.向 iPhone 应用程序报告任务已启动,并且它应该检查包含新任务 ID 的指定 URL 以获取最新状态。 The iPhone application can now poll (or even "long poll") this URL. iPhone 应用程序现在可以轮询(甚至“长轮询”)这个 URL。 In the meantime, the background process would update the database representation of the task as it worked with a completion percentage, current step, or whatever other status indicators you'd like.同时,后台进程将更新任务的数据库表示,因为它使用完成百分比、当前步骤或您想要的任何其他状态指标。 And when it has finished, it would set a completed flag.当它完成时,它会设置一个完成标志。

You can send it as an XHR (Ajax) request.您可以将其作为 XHR (Ajax) 请求发送。 Clients don't usually have any timeout for XHRs, unlike normal HTTP requests.与正常的 HTTP 请求不同,客户端通常不会对 XHR 有任何超时。

I would like to propose a solution that is a little different from symcbean's, mainly because I have additional requirement that the long running process need to be run as another user, and not as apache / www-data user.我想提出一个与 symcbean 略有不同的解决方案,主要是因为我有额外的要求,即长时间运行的进程需要作为另一个用户运行,而不是作为 apache / www-data 用户运行。

First solution using cron to poll a background task table:使用 cron 轮询后台任务表的第一个解决方案:

  • PHP web page inserts into a background task table, state 'SUBMITTED' PHP 网页插入到后台任务表中,状态为“已提交”
  • cron runs once each 3 minutes, using another user, running PHP CLI script that checks the background task table for 'SUBMITTED' rows cron 每 3 分钟运行一次,使用另一个用户,运行 PHP CLI 脚本,检查后台任务表中的“已提交”行
  • PHP CLI will update the state column in the row into 'PROCESSING' and begin processing, after completion it will be updated to 'COMPLETED' PHP CLI 会将行中的状态列更新为 'PROCESSING' 并开始处理,完成后它将更新为 'COMPLETED'

Second solution using Linux inotify facility:使用 Linux inotify 工具的第二种解决方案:

  • PHP web page updates a control file with the parameters set by user, and also giving a task id PHP 网页使用用户设置的参数更新控制文件,并提供任务 ID
  • shell script (as a non-www user) running inotifywait will wait for the control file to be written运行 inotifywait 的 shell 脚本(作为非 www 用户)将等待控制文件被写入
  • after control file is written, a close_write event will be raised an the shell script will continue写入控制文件后,将引发 close_write 事件,shell 脚本将继续
  • shell script executes PHP CLI to do the long running process shell 脚本执行 PHP CLI 来执行长时间运行的过程
  • PHP CLI writes the output to a log file identified by task id, or alternatively updates progress in a status table PHP CLI 将输出写入由任务 ID 标识的日志文件,或者更新状态表中的进度
  • PHP web page could poll the log file (based on task id) to show progress of the long running process, or it could also query status table PHP 网页可以轮询日志文件(基于任务 ID)以显示长时间运行进程的进度,也可以查询状态表

Some additional info could be found in my post : http://inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html一些额外的信息可以在我的帖子中找到:http: //inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html

I have done similar things with Perl, double fork() and detaching from parent process.我已经用 Perl、双 fork() 和从父进程分离做了类似的事情。 All http fetching work should be done in forked process.所有的 http 获取工作都应该在分叉的过程中完成。

使用代理来委派请求。

if you have long script then divide page work with the help of input parameter for each task.(then each page act like thread) ie if page has 1 lac product_keywords long process loop then instead of loop make logic for one keyword and pass this keyword from magic or cornjobpage.php(in following example)如果你有很长的脚本,那么在每个任务的输入参数的帮助下划分页面工作。(然后每个页面都像线程一样)即如果页面有 1 lac product_keywords 长进程循环,那么而不是循环为一个关键字创建逻辑并传递这个关键字来自魔术或cornjobpage.php(在以下示例中)

and for background worker i think you should try this technique it will help to call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.对于后台工作人员,我认为您应该尝试这种技术,它将有助于调用尽可能多的页面,所有页面将同时独立运行,而无需等待每个页面响应为异步。

cornjobpage.php //mainpage cornjobpage.php //主页

    <?php

post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue");
//post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue2");
//post_async("http://localhost/projectname/otherpage.php", "Keywordname=anyValue");
//call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.
            ?>
            <?php

            /*
             * Executes a PHP page asynchronously so the current page does not have to wait for it to     finish running.
             *  
             */
            function post_async($url,$params)
            {

                $post_string = $params;

                $parts=parse_url($url);

                $fp = fsockopen($parts['host'],
                    isset($parts['port'])?$parts['port']:80,
                    $errno, $errstr, 30);

                $out = "GET ".$parts['path']."?$post_string"." HTTP/1.1\r\n";//you can use POST instead of GET if you like
                $out.= "Host: ".$parts['host']."\r\n";
                $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
                $out.= "Content-Length: ".strlen($post_string)."\r\n";
                $out.= "Connection: Close\r\n\r\n";
                fwrite($fp, $out);
                fclose($fp);
            }
            ?>

testpage.php测试页.php

    <?
    echo $_REQUEST["Keywordname"];//case1 Output > testValue
    ?>

PS:if you want to send url parameters as loop then follow this answer : https://stackoverflow.com/a/41225209/6295712 PS:如果您想将 url 参数作为循环发送,请遵循以下答案: https ://stackoverflow.com/a/41225209/6295712

Not the best approach, as many stated here, but this might help:正如许多人所说,这不是最好的方法,但这可能会有所帮助:

ignore_user_abort(1); // run script in background even if user closes browser
set_time_limit(1800); // run it for 30 minutes

// Long running script here

如果您的脚本所需的输出是一些处理,而不是网页,那么我相信所需的解决方案是从 shell 运行您的脚本,就像

php my_script.php

what I ALWAYS use is one of these variants (because different flavors of Linux have different rules about handling output/some programs output differently):我一直使用的是这些变体之一(因为不同风格的 Linux 对处理输出有不同的规则/某些程序输出不同):

Variant I @exec('./myscript.php \1>/dev/null \2>/dev/null &');变体 I @exec('./myscript.php \1>/dev/null \2>/dev/null &');

Variant II @exec('php -f myscript.php \1>/dev/null \2>/dev/null &');变体 II @exec('php -f myscript.php \1>/dev/null \2>/dev/null &');

Variant III @exec('nohup myscript.php \1>/dev/null \2>/dev/null &');变体 III @exec('nohup myscript.php \1>/dev/null \2>/dev/null &');

You might havet install "nohup".你可能没有安装“nohup”。 But for example, when I was automating FFMPEG video converstions, the output interface somehow wasn't 100% handled by redirecting output streams 1 & 2, so I used nohup AND redirected the output.但是例如,当我自动化 FFMPEG 视频转换时,输出接口在某种程度上不是通过重定向输出流 1 和 2 来 100% 处理的,所以我使用 nohup 并重定向了输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM