Website-scraping , robot-identification

Question

Are there websites which identify it as a script that is accessing it , inspite of changing the User-Agent headers which I assume is like this and gives an error.

import urllib,urllib2
req_headers = {'User-Agent':'Mozilla/5.0'}
req = urllib2.Request(url,headers = req_headers)
html = req.open(url)

If yes , then how?

Answer 1

First of all, your User Agent is pretty incomplete and easily detectable as fake.

I describe some robot detection techniques in my answer to Hunting cheaters in a voting competition .

Answer 2

Yes. For starters, look at your complete header when browsing the web using a tool like Firebug. You'll notice normal browsers provide a lot of information such as languages accepted that is not provided by urllib . So a website might check for the presence of other header information.

Another trick would be to include a 1x1 pixel image on a page and check if the client requested the image file. If not, then the client is using either a text only browser (like lynx ) or is actually a script. I think JavaScript can also be used to look for the presence of a mouse.

Generally, it's a game of cat and mouse. One alternative to urllib is Selenium . Selenium will launch a browser window.

Website-scraping , robot-identification

Question

2 answers

solution1
0 2012-07-13 14:14:12

solution2
0 ACCPTED 2012-07-13 14:16:02

Website-scraping , robot-identification

Question

2 answers

solution1 0 2012-07-13 14:14:12

solution2 0 ACCPTED 2012-07-13 14:16:02

solution1
0 2012-07-13 14:14:12

solution2
0 ACCPTED 2012-07-13 14:16:02