简体   繁体   中英

How to crawl a website/extract data into database with python?

I'd like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data.

  • How would that work?
  • What tools/libraries can/should I use?
  • Are there good tutorials on that?
  • How do I best deal with binary data (eg pretty pdf)?
  • Are there already good solutions for that?

If you want to use a powerful scraping framework there's Scrapy . It has some good documentation too. It may be a little overkill depending on your task though.

Scrapy is probably the best Python library for crawling. It can maintain state for authenticated sessions.

Dealing with binary data should be handled separately. For each file type, you'll have to handle it differently according to your own logic. For almost any kind of format, you'll probably be able to find a library. For instance take a look at PyPDF for handling PDFs. For excel files you can try xlrd.

I liked using BeatifulSoup for extracting html data

It's as easy as this:

from BeautifulSoup import BeautifulSoup 
import urllib

ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
soup = BeautifulSoup(ur.read())
items = soup.findAll('item')

urls = [item.enclosure['url'] for item in items]

为此目的,有一个非常有用的工具叫做web-harvest链接到他们的网站http://web-harvest.sourceforge.net/我用它来抓取网页

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM