How do I store crawled data into a database

Question

I'm fairly new to python and everything else I'm about to talk about in this question but I want to get started with a project I've been thinking about for sometime now. Basically I want to crawl the web and display the urls as and when they are crawled in-real time on the web page. I coded a simple crawler which stores the urls in a list. I was wondering how to get this list into a database and have the database updated every x seconds, so that I can access the database and output the list of links on the web page periodically.

I don't know so much about real-time web development but that's a topic for another day. Right now though, I'm more concerned about how to get the list into the database. I'm currently using the web2py framework which is quite easy to get along with but if you guys have any recommendations as to where I should look, what frameworks I should check out... please do comment that too in your answers, thanks.

In a nutshell, the things I'm a noob at are: Python, databases, real-time web dev.

here's the code to my crawler if it helps in anyway :) thanks

from urllib2 import urlopen
def crawler(url,x):
    crawled=[]
    tocrawl=[]
    def crawl(url,x):
        x=x+1
        try:
            page = urlopen(url).read()
            findlink = page.find('<a href=')
            if findlink == -1:
                return None, 0
            while findlink!=-1:
                start = page.find(('"'), findlink)
                end = page.find(('"'), start+1)
                link = page[start+1:end]
                if link:
                    if link!=url:
                        if link[0]=='/':
                            link=url+link
                            link=replace(link)
                        if (link not in tocrawl) and (link!="") and (link not in crawled):
                            tocrawl.append(link)
                findlink = page.find('<a href=', end)
            crawled.append(url)
            while tocrawl:
                crawl(tocrawl[x],x)
        except:
            #keep crawling
            crawl(tocrawl[x],x)
    crawl(url,x)


def replace(link):
    tsp=link.find('//')
    if tsp==-1:
        return link
    link=link[0:tsp]+'/'+link[tsp+2:]
    return link

Answer 1

Instead of placing the URL's into a list, why not write them to the db directly? using for example mysql:

import MySQLdb
conn = MySQLdb.connect('server','user','pass','db')
curs = conn.cursor()
sql = 'INSERT into your_table VALUES(%s,%s)' %(id,str(link))
rc = curs.execute(sql)
conn.close()

This way you don't have to manage the list like pipe. But if that is necessary this can also be adapted for that method.

Answer 2

This sounds like a good job for Redis which has a built in list structure. To append new url to your list, it's as simple as:

from redis import Redis
red = Red()

# Later in your code...
red.lpush('crawler:tocrawl', link)

It also has a set type that let you efficiently check which websites you've crawled and let you sync multiple crawlers.

# Check if we're the first one to mark this link
if red.sadd('crawler:crawled', link):
    red.lpush('crawler:tocrawl', link)

To get the next link to crawl:

url = red.lpop('crawler:tocrawl')

To see which urls are queued to be crawled:

print red.lrange('crawler:tocrawl', 0, -1)

Its just one option but it is very fast and flexible. You can find more documentation on the redis python driver page.

Answer 3

To achieve this you need a Cron. A cron is a job scheduler for Unix-like computers. You can schedule a cron job to go every minute, every hour, every day, etc.

Check out this tutorial http://newcoder.io/scrape/intro/ and it will help you achieve what you want here.

Thanks. Info if it works.

How do I store crawled data into a database

Question

3 answers

solution1
0 2012-06-21 17:49:00

solution2
0 2012-06-21 17:49:28

solution3
0 2014-02-03 13:16:58

How do I store crawled data into a database

Question

3 answers

solution1 0 2012-06-21 17:49:00

solution2 0 2012-06-21 17:49:28

solution3 0 2014-02-03 13:16:58

solution1
0 2012-06-21 17:49:00

solution2
0 2012-06-21 17:49:28

solution3
0 2014-02-03 13:16:58