I'm trying to index many hundrets of web-pages.
In development everything went fine. But when I started to index much more then some testpages, CURL refused to work after some runs. It does not get any data from the remote server.
These errors CURL has printed out (of course not at once)
I'm working on a V-Server and tried to connect to the remote server using Firefox or wget. Also nothing. But when connecting to that remote server from my local machine everything works fine.
Waiting some hours, it again works for some runs.
For me it seems like a problem on the remote server or a DDOS-protection or something like that, what do you guys think?
How often is the script run? It really could be triggering some DOS-like protection. I would recommend implementing some random delay to make the requests seem delayed by some time to make them appear more "natural"
You should be using proxies when you send out too many requests as your IP can be blocked by the site by their DDOS protection or similar setups.
Here are somethings to note : (What I used for scraping datas of websites)
1.Use Proxies.
2.Use Random User Agents
3.Random Referers
4.Random Delay in crons.
5.Random Delay between requets.
What I would do is make the script run for ever and add sleep in between.
ignore_user_abort(1);
set_time_limit(0);
Just trigger it with visiting the url for a sec and it will run forever.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.