简体   繁体   中英

Blocked from scraping a website with Scrapy?

I'm still trying to scrape search results from this kind of URL , which is the search results for a Chinese online newspaper. Scrapy works for a few requests, and then I get the following terminal output.

2019-12-19 11:56:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <461 http://so.news.cn/getNews?keyword=%E7%BE%8E%E5%9B%BD&curPage=55&sortField=0&searchFields=0&lang=cn >: HTTP status code is not handled or not allowed

It seems to work better if I add a delay, but then it is very slow. Is this because I am being blocked by the site - and is there anything I can do about it? I don't currently have any special User-Agent defined in settings.py. I have tried using scrapy-UserAgent to rotate User-Agent, but it doesn't seem to be working. Would a VPN help?

Thanks

Different solutions to test :

  • Random pause between each requests
  • Make good use of sessions:

    1) Keep the same session for an amount of request (30 to 60)

    2) Clear your cookies after 30 to 60 request and change the user agent. Use this simple python framework: https://pypi.org/project/shadow-useragent/

    3) If that still does not work: rotate your IP over time (every 30 to 60 requests for instance) thanks to a proxy provider, rotate your user-agent, clear your cookies at the same time.

You should now look random for most of the websites. If you see any more bot mitigation (recaptchas) or specialized anti-scraping services, this could get trickier.

In addition to what was already said, I'd add that the right proxy service provider is crucial here.

Not only you have to rotate proxies really often, but their success rates have to be high as well, so in your case I'd go with residential IPs, which closely resemble real users.

Not to promote any, but you should look in to those, such as Luminati, Oxylabs, Geosurf etc.

More information about it here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM