简体   繁体   English

我的并发编程逻辑怎么了?

[英]What's wrong with my concurrent programming logic?

I wrote a web spider to spider pages concurrently. 我同时为蜘蛛页面编写了一个网络蜘蛛。 For each link that the spider finds, I want to fork off a new child that starts the process all over again. 对于Spider找到的每个链接,我想派一个新的孩子重新开始该过程。

I don't want to overload the target server so I created a static array that all objects can access. 我不想使目标服务器超载,所以我创建了一个静态数组,所有对象都可以访问。 Each child can add their PID to the array, and either parent or child should check the array to see if $maxChildren have been met, and if so, patiently wait until any child finishes. 每个孩子都可以将其PID添加到数组中,父母或孩子都应检查数组以查看$ maxChildren是否已满足,如果满足,请耐心等待直到任何孩子完成。

As you see, I have $maxChildren set to 3. I am expecting to see 3 simultaneous processes at any given time. 如您所见,我将$ maxChildren设置为3。我希望在任何给定时间都能同时看到3个进程。 However, that's not the case. 但是,事实并非如此。 The linux top command shows 12 to 30 processes at any given time. linux top命令在任何给定时间显示12至30个进程。 In concurrent programming, how can I regulate the number of simultaneous processes? 在并发编程中,如何调节并发进程的数量? My logic is currently inspired by how Apache handles it's max children, but I'm not exactly sure how that works. 目前,我的逻辑是受Apache处理max子级的方式启发的,但是我不确定这是如何工作的。

As pointed out in one of the answers, globally accessing the static variable brings up issues with race conditions. 正如答案之一所指出的那样,全局访问静态变量会带来竞争条件的问题。 To deal with this, the $children array takes the unique $PID of the process as both the key and it's value, thereby creating a unique value. 为了解决这个问题,$ children数组将进程的唯一$ PID用作键及其值,从而创建唯一值。 My thinking is that since any object can only deal with one $children[$pid] value, locking is not necessary. 我的想法是,由于任何对象只能处理一个$ children [$ pid]值,因此不必进行锁定。 Is this not true? 这不是真的吗 Is there a chance that two processes could try to unset or add the same value at some point? 两个进程是否有可能在某个时候尝试取消设置或添加相同的值?

private static $children = array();

private $maxChildren = 3;

public function concurrentSpider($url) {

        // STEP 1:
        // Download the $url
        $pageData = http_get($url, $ref = '');

        if (!$this->checkIfSaved($url)) {
            $this->save_link_to_db($url, $pageData);
        }

        // STEP 2:
        // extract all hyperlinks from this url's page data
        $linksOnThisPage = $this->harvest_links($url, $pageData);

        // STEP 3:
        // Check the links array from STEP 2 to see if this page has
        // already been saved or is excluded because of any other
        // logic from the excluded_link() function
        $filteredLinks = $this->filterLinks($linksOnThisPage);

        shuffle($filteredLinks);

        // STEP 4: loop through each of the links and
        // repeat the process
        foreach ($filteredLinks as $filteredLink) {

            $pid = pcntl_fork();
            switch ($pid) {
                case -1:
                    print "Could not fork!\n";
                    exit(1);
                case 0:
                    if ($this->checkIfSaved($filteredLink)) {
                        exit();
                    }
                    //$pid = getmypid();
                    print "In child with PID: " . getmypid() . " processing $filteredLink \n";


                    $var[$pid]->concurrentSpider($filteredLink);
                    sleep(2);

                    exit(1);
                default:
                    // Add an element to the children array
                    self::$children[$pid] = $pid;
                    // If the maximum number of children has been
                    // achieved, wait until one or more return
                    // before continuing.

                    while (count(self::$children) >= $this->maxChildren) {
                        //print count(self::$children) . " children \n";
                        $pid = pcntl_waitpid(-1, $status);
                        unset(self::$children[$pid]);
                    }
            }
        }
    }

This is written in PHP. 这是用PHP编写的。 I know that the pcntl_waitpid function with argument of -1 waits for any child to complete regardless of the parent ( http://php.net/manual/en/function.pcntl-waitpid.php ). 我知道参数为-1的pcntl_waitpid函数将等待任何子pcntl_waitpid完成而与父进程无关( http://php.net/manual/en/function.pcntl-waitpid.php )。

What's wrong with my logic and how can I correct it so that only $maxChildren processes are running simultaneously? 我的逻辑有什么问题,如何纠正它,以便仅$maxChildren进程同时运行? I'm also open to improving the logic in general if you have suggestions. 如果您有建议,我也乐于改善逻辑。

First thing to note: if this is truly a global being shared among multiple threads, it's possible that multiple threads are adding to it at once and you're running afoul of a race condition. 首先要注意的是:如果这确实是一个在多个线程之间共享的全局变量,则可能有多个线程一次添加到其中,而您正在运行竞争条件。 You need some sort of concurrency control to ensure that only one process is accessing your global array at once. 您需要某种并发控制来确保只有一个进程一次访问全局数组。

Also, try the simple debugging trick of having each process write out (to the console or to a file) its PID and the full contents of the global array each time a new spider is forked. 另外,尝试使用简单的调试技巧,每次派生一个新的Spider时,让每个进程(向控制台或文件)写出其PID和全局数组的全部内容。 It will help you to check your assumptions (which are plainly wrong at some point) and figure out what's going wrong. 这将帮助您检查您的假设(在某些时候显然是错误的)并找出问题所在。

EDIT: (In response to the comments) 编辑:(回应评论)

I'm not a PHP developer, but if I had to guess, based on the fact that you're using an OS tool that counts OS-level processes, I'd guess that your fork is spawning multiple processes, but your static array is global within the current process . 我不是PHP开发人员,但是如果您不得不猜测,基于您使用的是一个计算OS级进程的OS工具这一事实,我猜您的fork产生了多个进程,但是您的static数组在当前流程中是全球性 Implementing system-wide shared memory is a lot more complicated! 实现系统范围的共享内存要复杂得多!

If you just want to count something and ensure that instances of a shared resource don't grow out of control, look into semaphores , and see if you can find a way in PHP to create a named semaphore object that can be shared between multiple instances of your spider. 如果您只想计算一些东西并确保共享资源的实例不会失控,请查看semaphores ,看看是否可以在PHP中找到一种方法来创建可在多个实例之间共享的命名信号量对象。你的蜘蛛。

Use a real programming language ;) 使用真正的编程语言;)

Step 1 is kind of bad why are you downloading if it might be in the db. 步骤1有点不好,为什么要下载它(如果它可能在数据库中)。 Put that inside the if and see if you can put a mutex around it. 将其放在if中,看看是否可以在其周围放置互斥体。 Maybe so something in sql to imitate one. 也许是在sql中模仿一个东西。

I hope harvest_links uses a proper html processor with css selector support (i like fizzler for .NET). 我希望Harvest_links使用带有CSS选择器支持的正确html处理器(我喜欢.NET的Fizzler)。 I guess regular expression would be fine if its just to get links but it is possible to mess up. 我想如果只是为了获取链接,正则表达式就可以了,但是很可能搞​​砸了。

I see step 4 and i don't think its bad but personally i'd do it a different way. 我看到了第4步,但我认为它并不坏,但就我个人而言,我将以其他方式进行。 I'd have something like step one to insert url,page,flag into a db. 我会像第一步那样将url,page,flag插入数据库。 Then i'd have another process or the same one ask the db for unprocessed pages and set the flag to some value if it errors and another if its successful. 然后,我将有另一个进程或同一个进程向db请求未处理的页面,如果错误则将标志设置为某个值,如果成功则将其设置为另一个值。 This is so if something fails of the process exits (shutdown, crash, power out, etc) it can pick it up easily and don't need to scan every page to find where it left off. 这样一来,如果某个失败的进程退出(关机,崩溃,断电等),它就可以轻松地将其捡起,而无需扫描每个页面来查找遗漏的地方。 It just ask the database for the next link and redoes what it didnt finish 它只是向数据库询问下一个链接并重做它没有完成的操作

PHP doesn't support multithreading, therefore it doesn't support mutexes or any other synchronization methods. PHP不支持多线程,因此不支持互斥体或任何其他同步方法。 As others have said in their answers, this will lead to a race condition. 正如其他人在回答中所说的那样,这将导致比赛状况。

You'll have to write a wrapper in C or bash. 您必须使用C或bash编写包装器。 That way, the PHP script can submit targets to the wrapper, and the wrapper will handle scheduling. 这样,PHP脚本可以将目标提交给包装器,包装器将处理调度。

Another approach is to rewrite your spider in Python or Ruby, both of which support multithreading. 另一种方法是用Python或Ruby重写蜘蛛,它们都支持多线程。 That will eliminate the need for interprocess communication. 这将消除对进程间通信的需求。

Edit: On second thought, the best way is to write the wrapper in Python or Ruby and reuse your existing PHP code as a black box. 编辑:经过深思熟虑,最好的方法是用Python或Ruby编写包装程序,并将现有的PHP代码作为黑盒重用。 That's a compromise of the solutions above. 这是上述解决方案的折衷方案。

If the spider is for practical purposes, you might want to google "curl multithread" 如果蜘蛛是出于实用目的,那么您可能要使用Google“ curl multithread”

cURL Multi Threading with PHP 使用PHP进行cURL多线程

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM