So I have written a script which has 2 functions:
parse:
parse_city:
My page one is extracted fine using the logic, but the next pages don't seem to be going over to parse_city().
Here is the dummy code:
import scrapy
from bs4 import BeautifulSoup as bs
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class TjSpider(scrapy.Spider):
global count
count = 0
name = 'TJ'
allowed_domains = ['example.com']
start_urls = ['http://example.com/xyz']
def parse(self, response):
urls = response.xpath('//h2/a/@href').getall()
for url in urls:
print(url)
base_url = "http://example.com"
yield Request(base_url+url, callback=self.parse_city)
try:
next_page = response.xpath('//div[@class="fr"]/em[@class="active"]/following-sibling::em[1]/a/@href').extract()
next_p="abc.com"+next_page[0]
if next_p:
yield Request(next_p,callback=self.parse)
except Exception as e:
print("Pages over")
def parse_city(self, response):
global count
#scrape_details
title = response.xpath("<xpath to title>").extract()
yield {
'title' = title
}
It prints the urls extracted too for each page. But doesn't go into parse_city() for next pages. I am new to scrapy, I don't understand what's going wrong
OUTPUTS:
2020-11-03 19:32:19 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: TJscrape)
2020-11-03 19:32:19 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Jul 17 2020, 12:50:27) - [GCC 8.4.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Linux-5.3.0-28-generic-x86_64-with-Ubuntu-18.04-bionic
2020-11-03 19:32:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-11-03 19:32:19 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_MAX_DELAY': 3,
'AUTOTHROTTLE_START_DELAY': 1,
'BOT_NAME': 'TJscrape',
'DOWNLOAD_DELAY': 2,
'NEWSPIDER_MODULE': 'TJscrape.spiders',
'SPIDER_MODULES': ['TJscrape.spiders']}
2020-11-03 19:32:19 [scrapy.extensions.telnet] INFO: Telnet Password: 5b735759a5050862
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-11-03 19:32:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-11-03 19:32:19 [scrapy.core.engine] INFO: Spider opened
2020-11-03 19:32:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-03 19:32:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
{'downloader/request_bytes': 143534,
'downloader/request_count': 384,
'downloader/request_method_count/GET': 384,
'downloader/response_bytes': 10086468,
'downloader/response_count': 384,
'downloader/response_status_count/200': 192,
'downloader/response_status_count/301': 192,
'elapsed_time_seconds': 939.533443,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 11, 3, 8, 42, 2, 246351),
'item_scraped_count': 50,
'log_count/DEBUG': 435,
'log_count/INFO': 25,
'memusage/max': 134905856,
'memusage/startup': 57491456,
'offsite/domains': 1,
'offsite/filtered': 7021,
'request_depth_max': 142,
'response_received_count': 192,
'scheduler/dequeued': 384,
'scheduler/dequeued/memory': 384,
'scheduler/enqueued': 384,
'scheduler/enqueued/memory': 384,
'start_time': datetime.datetime(2020, 11, 3, 8, 26, 22, 712908)}
You have a syntax error in parse_city
:
yield {
'title': title
}
UPDATE You have a lot of offsite requests filtered. You have:
allowed_domains = ['example.com']
but trying to get next_page
from abc.com
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.