简体   繁体   中英

Why Scrapy Udemy gives response 403 error?

I was trying to use scrapy shell to see response.css result of the page basically. the simple code snippet which i was using is response.css("title::text").extract(). Normally this should give you the title of the webpage. But i understand that it is not possible for Udemy. On the other hand i used it for amazon and it is working fine. Any comments?

scrapy shell "https://www.udemy.com/courses/search/?q=python&src=sac&kw=python"
response.css("title::text").extract()
['Access to this page has been denied.']

on the other hand this below one is working fine.

scrapy shell "https://www.amazon.com/s?k=garlic+press&crid=2DY5U90PELGKN&sprefix=garlic+pres%2Caps%2C286&ref=nb_sb_ss_i_1_11"

response.css("title::text").extract()
['Amazon.com: garlic press']

EDIT:

scrapy shell --set=USER_AGENT='Mozilla/5.0' "https://www.udemy.com/courses/search/?q=python&src=sac&kw=python"
response.css("h4::text").extract()
[]

Udemy is trying to prevent you from using automation scraping. It returns an HTTP 403 response, and in that response's body there's some text stating:

Access to this page has been denied because we believe you are using automation tools to browse the website.

They're blocking when the value of the HTTP header User-Agent is not something that they want to access their content. Luckily, headers can be spoofed.

scrapy shell --set=USER_AGENT='Mozilla/5.0' "https://www.udemy.com/courses/search/?q=python&src=sac&kw=python"

Ought to work (though, I don't have python/scrapy on this machine, so I didn't test)

edit: I'm not certain about the legalities of circumventing their bot protection... Make sure to check your local laws before you use this advice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM