如何使用python抓取网站/将数据提取到数据库中？

Question

I'd like to build a webapp to help other students at my university create their schedules. 我想构建一个webapp来帮助我大学的其他学生创建他们的日程安排。 To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. 为此，我需要抓取主时间表（一个巨大的html页面）以及每个课程的详细描述链接到数据库中，最好是在python中。 Also, I need to log in to access the data. 另外，我需要登录才能访问数据。

How would that work? 那会怎么样？
What tools/libraries can/should I use? 我可以/应该使用哪些工具/库？
Are there good tutorials on that? 有没有很好的教程？
How do I best deal with binary data (eg pretty pdf)? 我如何最好地处理二进制数据（例如漂亮的pdf）？
Are there already good solutions for that? 那已经有很好的解决方案吗？

Answer 1

requests for downloading the pages. 下载页面的requests 。
- Here's an example of how to login to a website and download pages: https://stackoverflow.com/a/8316989/311220 以下是如何登录网站和下载页面的示例： https ： //stackoverflow.com/a/8316989/311220
lxml for scraping the data. lxml用于抓取数据。

If you want to use a powerful scraping framework there's Scrapy . 如果你想使用强大的刮擦框架，那就是Scrapy 。 It has some good documentation too. 它也有一些很好的文档。 It may be a little overkill depending on your task though. 根据你的任务，这可能有点矫枉过正。

Answer 2

Scrapy is probably the best Python library for crawling. Scrapy可能是最好的爬行Python库。 It can maintain state for authenticated sessions. 它可以维护经过身份验证的会话的状态。

Dealing with binary data should be handled separately. 处理二进制数据应单独处理。 For each file type, you'll have to handle it differently according to your own logic. 对于每种文件类型，您必须根据自己的逻辑以不同方式处理它。 For almost any kind of format, you'll probably be able to find a library. 对于几乎任何类型的格式，您可能都能找到一个库。 For instance take a look at PyPDF for handling PDFs. 例如，看看PyPDF处理PDF。 For excel files you can try xlrd. 对于excel文件，您可以尝试xlrd。

Answer 3

I liked using BeatifulSoup for extracting html data 我喜欢使用BeatifulSoup来提取html数据

It's as easy as this: 它就像这样简单：

from BeautifulSoup import BeautifulSoup 
import urllib

ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
soup = BeautifulSoup(ur.read())
items = soup.findAll('item')

urls = [item.enclosure['url'] for item in items]

Answer 4

为此目的，有一个非常有用的工具叫做web-harvest链接到他们的网站http://web-harvest.sourceforge.net/我用它来抓取网页

如何使用python抓取网站/将数据提取到数据库中？

问题描述

4 个解决方案

解决方案1
11 已采纳 2011-12-01 01:55:49

解决方案2
3 2011-12-01 02:00:33

解决方案3
2 2011-12-01 02:02:26

解决方案4
0 2014-09-21 07:57:18

如何使用python抓取网站/将数据提取到数据库中？

问题描述

4 个解决方案

解决方案1 11 已采纳 2011-12-01 01:55:49

解决方案2 3 2011-12-01 02:00:33

解决方案3 2 2011-12-01 02:02:26

解决方案4 0 2014-09-21 07:57:18

解决方案1
11 已采纳 2011-12-01 01:55:49

解决方案2
3 2011-12-01 02:00:33

解决方案3
2 2011-12-01 02:02:26

解决方案4
0 2014-09-21 07:57:18