使用scrapy創建一個簡單的python搜尋器

Question

我目前正在嘗試使用Scrapey在python中創建一個簡單的爬蟲。 我想要它做的是閱讀鏈接列表，並保存它們鏈接到的網站的html。 現在，我可以獲取所有URL，但是無法弄清楚如何下載頁面。 到目前為止，這是我的蜘蛛的代碼：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BookItem

# Book scrappy spider

class DmozSpider(BaseSpider):
    name = "book"
    allowed_domains = ["learnpythonthehardway.org"]
    start_urls = [
        "http://www.learnpythonthehardway.org/book/",
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        file = open(filename,'wb')
        file.write(response.body)
        file.close()

        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        items = []
        for site in sites:
            item = BookItem()
            item['title'] = site.select('a/text()').extract()
            item['link'] = site.select('a/@href').extract()
            items.append(item)
        return items

Answer 1

在parse方法中，在返回的項目列表中返回Request對象以觸發下載：

for site in sites:
    ...
    items.append(item)
    items.append(Request(item['link']), callback=self.parse)

這將導致BookItem為每個鏈接生成一個BookItem ，而且還會遞歸並下載每本書的頁面。 當然，如果要以不同方式解析子頁面，則可以指定其他回調（例如self.parsebook ）。

使用scrapy創建一個簡單的python搜尋器

問題描述

1 個解決方案

解決方案1
1 已采納 2012-08-28 06:22:34

使用scrapy創建一個簡單的python搜尋器

問題描述

1 個解決方案

解決方案1 1 已采納 2012-08-28 06:22:34

解決方案1
1 已采納 2012-08-28 06:22:34