简体   繁体   English

在共享主机上每分钟通过cron作业卷曲php抓取

[英]curl php scraping through cron job every minute on Shared hosting

I have a tricky problem. 我有一个棘手的问题。 I am on a basic shared hosting. 我在一个基本的共享主机上。 I have created a good scraping script using curl and php. 我使用curl和php创建了一个很好的抓取脚本。

Because multi-threading with Curl is not really multi-threading and even the best curl multi-threading scripts I have used are speeding by 1,5-2 the scraping, I came to the conclusion that I need to run massive amount of cron tasks (like 50) per minute on my php script that interacts with a mysql table in order to offer fast web scraping to my customers. 因为使用Curl进行多线程并不是真正的多线程,甚至我使用的最好的curl多线程脚本都将抓取速度提高了1.5-2,所以得出的结论是,我需要运行大量的cron任务(例如50)每分钟与mysql表进行交互的php脚本上,以便向客户提供快速的网页抓取。

My problem is that I get a "Mysql server has gone away" when having lots of cron tasks running at the same time. 我的问题是,同时运行许多cron任务时,出现“ Mysql服务器已消失”的问题。 If I decrease the number of cron tasks, it continues to work but always slow. 如果我减少了cron任务的数量,它会继续工作,但总是很慢。

I have also tried a browser-based solution by reloading the script every time the while is finished. 我还尝试了一种基于浏览器的解决方案,方法是在每次结束时重新加载脚本。 It works better but always the same problem: When I decide to run 10 times the script at the same time, it begins to overload the mysql server or the web server (i don't know) 它工作得更好,但始终是相同的问题:当我决定同时运行10次脚本时,它开始使mysql服务器或Web服务器超载(我不知道)

To resolve this, I have acquired an mysql server where I can set the my.cnf ...but the problem stays approximatively the same. 为了解决这个问题,我已经购买了一个mysql服务器,可以在其中设置my.cnf ...,但是问题仍然大致相同。

========= MY QUESTION IS : WHERE THE PROBLEM IS COMING FROM ? ========= 我的问题是:问题出在哪里? TABLE SIZE ? 表大小? I NEED A BIG 100MBPS DEDICATED SERVER. 我需要一个大的100MBPS专用服务器。 IF YES, ARE YOU SURE IT WILL RESOLVE THE PROBLEM, AND HOW FAST IT IS ? 如果是,您确定它可以解决问题,而且速度如何? BY KNOWING I WANT THAT THE EXTRACTION SPEED GOES TO APPROXIMATIVELY 100 URLS PER SECOND (at this time, it goes to 1 URL per 15 seconds, incredibly slow...) 通过知道,我希望提取速度大约每秒增加100个网址(目前,每15秒访问1个网址,速度非常慢...)

  • There is only one while on the script. 脚本上只有一阵子。 It loads all the page and preg match or dom data and insert into mysql database. 它加载所有页面和preg匹配项或dom数据,然后插入mysql数据库。

  • I extract lots of data, this is why a table fastly contain millions of entries...but when I remove them, maybe it goes a bit faster but it is always the same problem: it is impossible to massively run tasks in parallel in order to accelerate the process. 我提取了大量数据,这就是为什么一个表快速包含数百万个条目的原因...但是当我删除它们时,它的运行速度可能会更快一些,但这始终是相同的问题:不可能按顺序并行地大规模运行任务加快过程。

  • I don't think the problem is coming from my script. 我不认为问题出在我的脚本上。 In all the cases, even optimized perfectly, I will not go as fast as I want. 在所有情况下,即使进行了完美的优化,我也不会按照自己想要的速度走。

  • I ested by using the script withotu proxies for scraping, but the difference is very small..not significant.. 我通过使用带有withotu代理的脚本进行刮擦进行了评估,但是区别很小。.不明显。

My conclusion is that I need to use a dedicated server but I don't want to invest like 100$ per month if I am not sure It will resolve the problem and I will be able to run these massive amounts of cron tasks / calls on the mysql db without problem. 我的结论是,我需要使用专用服务器,但是如果我不确定,我不想每月投资100美元,这样就可以解决问题,并且我能够运行大量的cron任务/ mysql数据库没有问题。

It's so easy... never send multithreading on the same URL. 非常简单...永远不要在同一URL上发送多线程。 May be many different URLs. 可能有许多不同的URL。 But try to respect a certain timeout. 但是,请尝试遵守一定的超时时间。 You can do that with: 您可以执行以下操作:

sleep($random);  $random = random(15, 35) ; // in seconds

I would have to see the code but essentially it does look like you are being rate limited by your host. 我将不得不查看代码,但从本质上讲,它的确像是主机受到速率限制。

Is it possible to run your cron once a minute or two but batch the scrapes onto one SQL connect in your script? 是否可以每隔一分钟或两分钟运行一次cron,但在脚本中将这些剪贴簿批处理到一个SQL connect中呢?

Essentially, the goal would be to open the sql socket once and run multiple URL scrapes on the connect vs. your current one scrape per mysql connect hopefully avoiding the rate limiting by your host. 本质上,目标是打开sql套接字一次,并在连接上运行多个URL抓取,而当前每个mysql连接一次抓取,希望避免主机对速率的限制。

Pseudo-code: 伪代码:

<?php
$link = mysqli_connect("127.0.0.1", "my_user", "my_password", "my_db");
$sql = "SELECT url FROM urls_table WHERE scraped='0' LIMIT 100";
$result = mysqli_query($link, $sql);
while($row = mysqli_fetch_array($result, MYSQLI_NUM)){
    $url_to_scrape = $row[0];
    //TODO: your scrape code goes here
}
//Only AFTER you've scraped multiple URLs do we close the connection
//this will drastically reduce the number of SQL connects and should help
mysqli_close($link);
?>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM