简体   繁体   中英

crawling scraping and threading? with php

I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...

The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.

The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2

My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.

Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.

Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.

What is the best way I could speed this process up.

Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?

Is there some way to create threading in PHP so that a script can process links at its own pace?

Any ideas?

Thanks.

pseudo code for running parallel scanners:

start_a_scan(){
    //Start mysql transaction (needs InnoDB afaik)        
    BEGIN 
        //Get first entry that has timed out and is not being scanned by someone
        //(And acquire an exclusive lock on affected rows)
        $row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
                (scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
                      LIMIT 1 FOR UPDATE
        //let everyone know we're scanning this one, so they'll keep out
        UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
    //Commit transaction
    COMMIT
    //scan
    scan_target($row['url'])
    //update entry state to allow it to be scanned in the future again
    UPDATE scan_targets SET being_scanned = false, \
              scanned_at = NOW() WHERE id = $row['id']
}

You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.

And then you can have several scan processes running in parallel! Yey!

cheers!

EDIT : I forgot that you need to make the first SELECT with FOR UPDATE. Read more here

USE CURL MULTI!

Curl-mutli will let you process the pages in parallel.

http://us3.php.net/curl

Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.

You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down

http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/

This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.

Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.

Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.

Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.

You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.

CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.

It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM