简体   繁体   中英

Why does scrapy miss some links?

I am scraping the web-site "www.accell-group.com" using the "scrapy" library for Python. The site is scraped completely, in total 131 pages (text/html) and 2 documents (application/pdf) are identified. Scrapy did not throw any warnings or errors. My algorithm is supposed to scrape every single link. I use CrawlSpider.

However, when I look into the page " http://www.accell-group.com/nl/investor-relations/jaarverslagen/jaarverslagen-van-accell-group.htm ", which is reported by "scrapy" as scraped/processed, I see that there are more pdf-documents, for example " http://www.accell-group.com/files/4/5/0/1/Jaarverslag2014.pdf ". I cannot find any reasons for it not to be scraped. There is no dynamic/JavaScript content on this page. It is not forbidden in " http://www.airproducts.com/robots.txt ".

Do you maybe have any idea why it can happen? It is maybe because the "files" folder is not in " http://www.accell-group.com/sitemap.xml "?

Thanks in advance!

My code:

class PyscrappSpider(CrawlSpider):
    """This is the Pyscrapp spider"""
    name = "PyscrappSpider"

    def__init__(self, *a, **kw):

        # Get the passed URL
        originalURL =  kw.get('originalURL')
        logger.debug('Original url = {}'.format(originalURL))

        # Add a protocol, if needed
        startURL = 'http://{}/'.format(originalURL)
        self.start_urls = [startURL]

        self.in_redirect = {}
        self.allowed_domains = [urlparse(i).hostname.strip() for i in self.start_urls]
        self.pattern = r""
        self.rules = (Rule(LinkExtractor(deny=[r"accessdenied"]), callback="parse_data", follow=True), )

        # Get WARC writer        
        self.warcHandler = kw.get('warcHandler')

        # Initialise the base constructor
        super(PyscrappSpider, self).__init__(*a, **kw)


    def parse_start_url(self, response):
        if (response.request.meta.has_key("redirect_urls")):
            original_url = response.request.meta["redirect_urls"][0]
            if ((not self.in_redirect.has_key(original_url)) or (not self.in_redirect[original_url])):
                self.in_redirect[original_url] = True
                self.allowed_domains.append(original_url)
        return self.parse_data(response)

    def parse_data(self, response):

        """This function extracts data from the page."""

        self.warcHandler.write_response(response)

        pattern = self.pattern

        # Check if we are interested in the current page
        if (not response.request.headers.get('Referer') 
            or re.search(pattern, self.ensure_not_null(response.meta.get('link_text')), re.IGNORECASE) 
            or re.search(r"/(" + pattern + r")", self.ensure_not_null(response.url), re.IGNORECASE)):

            logging.debug("This page gets processed = %(url)s", {'url': response.url})

            sel = Selector(response)

            item = PyscrappItem()
            item['url'] = response.url


            return item
        else:

            logging.warning("This page does NOT get processed = %(url)s", {'url': response.url})
            return response.request

Remove or expand appropriately your "allowed_domains" variable and you should be fine. All the URLs the spider follows, by default, are restricted by allowed_domains.

EDIT: This case mentions particularly pdfs. PDFs are explicitly excluded as extensions as per the default value of deny_extensions (see here ) which is IGNORED_EXTENSIONS (see here ).

To allow your application to crawl PDFs all you have to do is to exclude them from IGNORED_EXTENSIONS by setting explicitly the value for deny_extensions :

from scrapy.linkextractors import IGNORED_EXTENSIONS

self.rules = (Rule(...

LinkExtractor(deny=[r"accessdenied"], deny_extensions=set(IGNORED_EXTENSIONS)-set(['pdf']))

..., callback="parse_data"...

So, I'm afraid, this is the answer to the question "Why does Scrapy miss some links?". As you will likely see it just opens the doors to further questions, like "how do I handle those PDFs" but I guess this is the subject of another question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM