How to design a crawl bot?

Question

I'm working on a little project to analyze the content on some sites I find interesting; this is a real DIY project that I'm doing for my entertainment/enlightenment, so I'd like to code as much of it on my own as possible.

Obviously, I'm going to need data to feed my application, and I was thinking I would write a little crawler that would take maybe 20k pages of html and write them to text files on my hard drive. However, when I took a look on SO and other sites, I couldn't find any information on how to do this. Is it feasible? It seems like there are open-source options available (webpshinx?), but I would like to write this myself if possible.

Scheme is the only language I know well, but I thought I'd take use this project to learn myself some some Java, so I'd be interested if there are any racket or java libraries that would be helpful for this.

So I guess to summarize my question, what are some good resources to get started on this? How can I get my crawler to request info from other servers? Will I have to write a simple parser for this, or is that unnecessary given I want to take the whole html file and save it as txt?

Answer 1

This is entirely feasible, and you can definitely do it with Racket. You may want to take a look at the PLaneT libraries; In particular, Neil Van Dyke's HtmlPrag:

http://planet.racket-lang.org/display.ss?package=htmlprag.plt&owner=neil

.. is probably the place to start. You should be able to pull the content a web page into a parsed format in one or two lines of code.

Let me know if you have any questions about this.

Answer 2

Having done this myself in Racket, here is what I would suggest.

Start with a "Unix tools" approach:

Use curl to do the work of downloading each page (you can execute it from Racket using system ) and storing the output in a temporary file.
Use Racket to extract the URIs from the <a> tags.
- You can "cheat" and do a regular expression string search.
- Or, do it "the right way" with a true HTML parser, as John Clements' great answer explains.
- Consider maybe doing the cheat first, then looping back later to do it the right way.

At this point you could stop, or, you could go back and replace curl with your own code to do the downloads. For this you can use Racket's net/url module.

Why I suggest trying curl , first, is that it helps you do something more complicated than it might seem:

Do you want to follow 30x redirects?
Do you want to accept/store/provide cookies (the site may behave differently otherwise)?
Do you want to use HTTP keep-alive?
And on and on.

Using curl for example like this:

(define curl-core-options
  (string-append
   "--silent "
   "--show-error "
   "--location "
   "--connect-timeout 10 "
   "--max-time 30 "
   "--cookie-jar " (path->string (build-path 'same "tmp" "cookies")) " "
   "--keepalive-time 60 "
   "--user-agent 'my crawler' "
   "--globoff " ))

(define (curl/head url out-file)
  (system (format "curl ~a --head --output ~a --url \"~a\""
                   curl-core-options
                   (path->string out-file)
                   url)))

(define (curl/get url out-file)
  (system (format "curl ~a --output ~a --url \"~a\""
                  curl-core-options
                  (path->string out-file)
                  url)))

represents is a lot of code that you would otherwise need to write from scratch in Racket. To do all the things those curl command line flags are doing for you.

In short: Start with the simplest case of using existing tools. Use Racket almost as a shell script. If that's good enough for you, stop. Otherwise go on to replace the tools one by one with your bespoke code.

Answer 3

I suggest looking into the open source web crawler for java known as crawler4j .

It is very simple to use and it provides very good resources and options for your crawling.

Answer 4

If you know scheme, and you want to ease into Java, why don't you start with Clojure?

You can leverage your lisp knowledge, and take advantage of java html parsing libraries* out there to get something working. Then if you want to start transitioning parts of it to Java to learn a bit, you can write bits of functionality in Java and wire that into the Clojure code.

Good luck!

* I've seen several SO questions on this.

Answer 5

If I were you, I wouldn't write a crawler -- I'd use one of the many free tools that download web sites locally for offline browsing (eg http://www.httrack.com/ ) to do the spidering. You may need to tweak the options to disable downloading images, etc, but those tools are going to be way more robust and configurable than anything you write yourself.

Once you do that, you'll have a whole ton of HTML files locally that you can feed to your application.

I've done a lot of textual analysis of HTML files; as a Java guy, my library of choice for distilling HTML into text (again, not something you want to roll yourself) is the excellent Jericho parser: http://jericho.htmlparser.net/docs/index.html

EDIT: re-reading your question, it does appear that you are set on writing your own crawler; if so, I would recommend Commons HttpClient to do the downloading, and still Jericho to pull out the links and process them into new requests.

Answer 6

I did that in Perl years ago (much easier, even without the webcrawler module).

I suggest you read the wget documentation and use the tool for inspiration. Wget is the netcat of webcrawling; its feature set will inspire you.

Your program should accept a list of URLs to start with and add them to a list of URLs to try. You then have to decide if you want to collect every url or only add those from the domains (and subdomains?) provided in the initial list.

I made you a fairly robust starting point in Scheme:

(define (crawl . urls)
  ;; I would use regular expressions for this unless you have a special module for this
  ;; Hint: URLs tend to hide in comments. referal tags, cookies... Not just links.
  (define (parse url) ...)
  ;; For this I would convert URL strings to a standard form then string=
  (define (url= x y) ...)
  ;; use whatever DNS lookup mecanism your implementation provides
  (define (get-dom) ...)
  ;; the rest should work fine on its own unless you need to modify anything
  (if (null? urls) (error "No URLs!")
      (let ([doms (map get-dom urls)])
        (let crawl ([done '()])
          (receive (url urls) (car+cdr urls)
            (if (or (member url done url=)
                      (not (member (get-dom url) doms url=)))
                (crawl urls done)
                (begin (parse url) (display url) (newline)
                  (crawl (cons url done)))))))))

How to design a crawl bot?

Question

6 answers

solution1
5 2012-01-20 02:38:05

solution2
1 2012-01-25 17:05:33

solution3
0 2012-01-20 01:33:36

solution4
0 2012-01-20 03:44:57

solution5
0 2012-01-20 05:31:21

solution6
0 2012-01-29 02:50:25

How to design a crawl bot?

Question

6 answers

solution1 5 2012-01-20 02:38:05

solution2 1 2012-01-25 17:05:33

solution3 0 2012-01-20 01:33:36

solution4 0 2012-01-20 03:44:57

solution5 0 2012-01-20 05:31:21

solution6 0 2012-01-29 02:50:25

solution1
5 2012-01-20 02:38:05

solution2
1 2012-01-25 17:05:33

solution3
0 2012-01-20 01:33:36

solution4
0 2012-01-20 03:44:57

solution5
0 2012-01-20 05:31:21

solution6
0 2012-01-29 02:50:25