简体繁体中英

Blocked from scraping a website with Scrapy?

原文 2019-12-19 11:12:01 4 2 python/ web-scraping/ scrapy/ user-agent

I'm still trying to scrape search results from this kind of URL , which is the search results for a Chinese online newspaper. Scrapy works for a few requests, and then I get the following terminal output.

2019-12-19 11:56:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <461 http://so.news.cn/getNews?keyword=%E7%BE%8E%E5%9B%BD&curPage=55&sortField=0&searchFields=0&lang=cn >: HTTP status code is not handled or not allowed

It seems to work better if I add a delay, but then it is very slow. Is this because I am being blocked by the site - and is there anything I can do about it? I don't currently have any special User-Agent defined in settings.py. I have tried using scrapy-UserAgent to rotate User-Agent, but it doesn't seem to be working. Would a VPN help?

Thanks

2 answers

Different solutions to test :

Random pause between each requests
Make good use of sessions:
1) Keep the same session for an amount of request (30 to 60)
2) Clear your cookies after 30 to 60 request and change the user agent. Use this simple python framework: https://pypi.org/project/shadow-useragent/
3) If that still does not work: rotate your IP over time (every 30 to 60 requests for instance) thanks to a proxy provider, rotate your user-agent, clear your cookies at the same time.

You should now look random for most of the websites. If you see any more bot mitigation (recaptchas) or specialized anti-scraping services, this could get trickier.

In addition to what was already said, I'd add that the right proxy service provider is crucial here.

Not only you have to rotate proxies really often, but their success rates have to be high as well, so in your case I'd go with residential IPs, which closely resemble real users.

Not to promote any, but you should look in to those, such as Luminati, Oxylabs, Geosurf etc.

More information about it here

Scraping a website with scrapy

Scraping website with Scrapy

404: Is there any way to avoid being blocked by website while scraping using scrapy

How to know if you got blocked from a website for web scraping?

Scraping from booking website using scrapy, the file csv is empty

Python Scrapy - Scraping data from multiple website URLs

Scraping all the links from a website using scrapy not working

Scraping a website using Scrapy and selenium

Scraping website using python & scrapy

Scrapy - Problems scraping simple website

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Scraping a website with scrapy Scraping website with Scrapy 404: Is there any way to avoid being blocked by website while scraping using scrapy How to know if you got blocked from a website for web scraping? Scraping from booking website using scrapy, the file csv is empty Python Scrapy - Scraping data from multiple website URLs Scraping all the links from a website using scrapy not working Scraping a website using Scrapy and selenium Scraping website using python & scrapy Scrapy - Problems scraping simple website

Related Tags

Blocked from scraping a website with Scrapy?

Question

2 answers

solution1
3 ACCPTED 2019-12-19 14:03:05

solution2
1 2020-01-06 13:41:00

Blocked from scraping a website with Scrapy?

Question

2 answers

solution1 3 ACCPTED 2019-12-19 14:03:05

solution2 1 2020-01-06 13:41:00

solution1
3 ACCPTED 2019-12-19 14:03:05

solution2
1 2020-01-06 13:41:00