简体   繁体   中英

curl php scraping through cron job every minute on Shared hosting

I have a tricky problem. I am on a basic shared hosting. I have created a good scraping script using curl and php.

Because multi-threading with Curl is not really multi-threading and even the best curl multi-threading scripts I have used are speeding by 1,5-2 the scraping, I came to the conclusion that I need to run massive amount of cron tasks (like 50) per minute on my php script that interacts with a mysql table in order to offer fast web scraping to my customers.

My problem is that I get a "Mysql server has gone away" when having lots of cron tasks running at the same time. If I decrease the number of cron tasks, it continues to work but always slow.

I have also tried a browser-based solution by reloading the script every time the while is finished. It works better but always the same problem: When I decide to run 10 times the script at the same time, it begins to overload the mysql server or the web server (i don't know)

To resolve this, I have acquired an mysql server where I can set the my.cnf ...but the problem stays approximatively the same.

========= MY QUESTION IS : WHERE THE PROBLEM IS COMING FROM ? TABLE SIZE ? I NEED A BIG 100MBPS DEDICATED SERVER. IF YES, ARE YOU SURE IT WILL RESOLVE THE PROBLEM, AND HOW FAST IT IS ? BY KNOWING I WANT THAT THE EXTRACTION SPEED GOES TO APPROXIMATIVELY 100 URLS PER SECOND (at this time, it goes to 1 URL per 15 seconds, incredibly slow...)

  • There is only one while on the script. It loads all the page and preg match or dom data and insert into mysql database.

  • I extract lots of data, this is why a table fastly contain millions of entries...but when I remove them, maybe it goes a bit faster but it is always the same problem: it is impossible to massively run tasks in parallel in order to accelerate the process.

  • I don't think the problem is coming from my script. In all the cases, even optimized perfectly, I will not go as fast as I want.

  • I ested by using the script withotu proxies for scraping, but the difference is very small..not significant..

My conclusion is that I need to use a dedicated server but I don't want to invest like 100$ per month if I am not sure It will resolve the problem and I will be able to run these massive amounts of cron tasks / calls on the mysql db without problem.

It's so easy... never send multithreading on the same URL. May be many different URLs. But try to respect a certain timeout. You can do that with:

sleep($random);  $random = random(15, 35) ; // in seconds

I would have to see the code but essentially it does look like you are being rate limited by your host.

Is it possible to run your cron once a minute or two but batch the scrapes onto one SQL connect in your script?

Essentially, the goal would be to open the sql socket once and run multiple URL scrapes on the connect vs. your current one scrape per mysql connect hopefully avoiding the rate limiting by your host.

Pseudo-code:

<?php
$link = mysqli_connect("127.0.0.1", "my_user", "my_password", "my_db");
$sql = "SELECT url FROM urls_table WHERE scraped='0' LIMIT 100";
$result = mysqli_query($link, $sql);
while($row = mysqli_fetch_array($result, MYSQLI_NUM)){
    $url_to_scrape = $row[0];
    //TODO: your scrape code goes here
}
//Only AFTER you've scraped multiple URLs do we close the connection
//this will drastically reduce the number of SQL connects and should help
mysqli_close($link);
?>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM