简体   繁体   中英

Website-scraping , robot-identification

Are there websites which identify it as a script that is accessing it , inspite of changing the User-Agent headers which I assume is like this and gives an error.

import urllib,urllib2
req_headers = {'User-Agent':'Mozilla/5.0'}
req = urllib2.Request(url,headers = req_headers)
html = req.open(url)

If yes , then how?

First of all, your User Agent is pretty incomplete and easily detectable as fake.

I describe some robot detection techniques in my answer to Hunting cheaters in a voting competition .

Yes. For starters, look at your complete header when browsing the web using a tool like Firebug. You'll notice normal browsers provide a lot of information such as languages accepted that is not provided by urllib . So a website might check for the presence of other header information.

Another trick would be to include a 1x1 pixel image on a page and check if the client requested the image file. If not, then the client is using either a text only browser (like lynx ) or is actually a script. I think JavaScript can also be used to look for the presence of a mouse.

Generally, it's a game of cat and mouse. One alternative to urllib is Selenium . Selenium will launch a browser window.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM