使用Scrapy從網站查找和下載pdf文件

Question

我的任務是使用Scrapy從網站上提取pdf文件。 我不是Python的新手，但Scrapy對我來說是一個新手。 我一直在試驗控制台和一些基本的蜘蛛。 我發現並修改了這段代碼：

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        with open(path, 'wb') as f:
            f.write(response.body)

我在命令行運行此代碼

scrapy crawl mySpider

我一無所獲。 我沒有創建scrapy項目，因為我想抓取並下載文件，沒有元數據。 我將不勝感激任何幫助。

Answer 1

蜘蛛邏輯似乎不正確。

我快速瀏覽了一下你的網站，似乎有幾種類型的網頁：

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html初始頁面
特定文章的網頁，例如http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html ，可以從中導航第1頁
實際PDF位置，例如http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf可以從第2頁導航

因此，正確的邏輯看起來像：首先獲得＃1頁面，然后獲得＃2頁面，我們可以下載＃3頁面。
但是，您的蜘蛛嘗試直接從＃1頁面提取＃3頁面的鏈接。

編輯：

我已經更新了你的代碼，這里有一些實際工作：

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

使用Scrapy從網站查找和下載pdf文件

問題描述

1 個解決方案

解決方案1
22 已采納 2016-03-21 16:04:02

使用Scrapy從網站查找和下載pdf文件

問題描述

1 個解決方案

解決方案1 22 已采納 2016-03-21 16:04:02

解決方案1
22 已采納 2016-03-21 16:04:02