Link Checker (Spider Crawler)

Question

I am looking for a link checker to spider my website and log invalid links, the problem is that I have a Login page at the start which is required. What i want is a link checker to run through the command post login details then spider the rest of the website.

Any ideas guys will be appreciated.

Answer 1

I've just recently solved a similar problem like this:

import urllib
import urllib2
import cookielib

login = 'user@host.com'
password = 'secret'

cookiejar = cookielib.CookieJar()
urlOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar))

# adjust this to match the form's field names
values = {'username': login, 'password': password}
data = urllib.urlencode(values)
request = urllib2.Request('http://target.of.POST-method', data)
url = urlOpener.open(request)
# from now on, we're authenticated and we can access the rest of the site
url = urlOpener.open('http://rest.of.user.area')

Answer 2

You want to look at the cookielib module: http://docs.python.org/library/cookielib.html . It implements a full implementation of cookies, which will let you store login details. Once you're using a CookieJar, you just have to get login details from the user (say, from the console) and submit a proper POST request.

Link Checker (Spider Crawler)

Question

2 answers

solution1
3 ACCPTED 2009-10-02 20:20:23

solution2
2 2009-10-02 15:48:29

Link Checker (Spider Crawler)

Question

2 answers

solution1 3 ACCPTED 2009-10-02 20:20:23

solution2 2 2009-10-02 15:48:29

solution1
3 ACCPTED 2009-10-02 20:20:23

solution2
2 2009-10-02 15:48:29