简体   繁体   中英

How to extract data for next page using Scrapy

So I have written a script which has 2 functions:

parse:

  • extract urls from the main url and sends those to parse_city() to extract each url's details
  • Once this is done, parse() extracts the next page and calls itself to repeat the above step

parse_city:

  • extracts the details from each url.

My page one is extracted fine using the logic, but the next pages don't seem to be going over to parse_city().

Here is the dummy code:

import scrapy
from bs4 import BeautifulSoup as bs
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TjSpider(scrapy.Spider):
    global count
    count = 0
    name = 'TJ'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/xyz']
    def parse(self, response):
        urls = response.xpath('//h2/a/@href').getall()
        for url in urls:
            print(url)
            base_url = "http://example.com"
            yield Request(base_url+url, callback=self.parse_city)
        try:
            next_page = response.xpath('//div[@class="fr"]/em[@class="active"]/following-sibling::em[1]/a/@href').extract()
            next_p="abc.com"+next_page[0]
            if next_p:
                yield Request(next_p,callback=self.parse)
        except Exception as e:
            print("Pages over") 

def parse_city(self, response):
    global count
    #scrape_details
    title = response.xpath("<xpath to title>").extract()
    yield {
       'title' = title
    }

It prints the urls extracted too for each page. But doesn't go into parse_city() for next pages. I am new to scrapy, I don't understand what's going wrong

OUTPUTS:

2020-11-03 19:32:19 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: TJscrape)
2020-11-03 19:32:19 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Jul 17 2020, 12:50:27) - [GCC 8.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Linux-5.3.0-28-generic-x86_64-with-Ubuntu-18.04-bionic
2020-11-03 19:32:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-11-03 19:32:19 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_MAX_DELAY': 3,
 'AUTOTHROTTLE_START_DELAY': 1,
 'BOT_NAME': 'TJscrape',
 'DOWNLOAD_DELAY': 2,
 'NEWSPIDER_MODULE': 'TJscrape.spiders',
 'SPIDER_MODULES': ['TJscrape.spiders']}
2020-11-03 19:32:19 [scrapy.extensions.telnet] INFO: Telnet Password: 5b735759a5050862
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-11-03 19:32:19 [scrapy.core.engine] INFO: Spider opened
2020-11-03 19:32:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-03 19:32:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
{'downloader/request_bytes': 143534,
'downloader/request_count': 384,
'downloader/request_method_count/GET': 384,
'downloader/response_bytes': 10086468,
'downloader/response_count': 384,
'downloader/response_status_count/200': 192,
'downloader/response_status_count/301': 192,
'elapsed_time_seconds': 939.533443,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 11, 3, 8, 42, 2, 246351),
'item_scraped_count': 50,
'log_count/DEBUG': 435,
'log_count/INFO': 25,
'memusage/max': 134905856,
'memusage/startup': 57491456,
'offsite/domains': 1,
'offsite/filtered': 7021,
'request_depth_max': 142,
'response_received_count': 192,
'scheduler/dequeued': 384,
'scheduler/dequeued/memory': 384,
'scheduler/enqueued': 384,
'scheduler/enqueued/memory': 384,
'start_time': datetime.datetime(2020, 11, 3, 8, 26, 22, 712908)}

You have a syntax error in parse_city :

yield {
   'title': title
}

UPDATE You have a lot of offsite requests filtered. You have:

allowed_domains = ['example.com']

but trying to get next_page from abc.com .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM