You run Request of author url twice. First time to scrape list of authors. Second time to scrape current author details. Dumping Scrapy stats (in the end of logging) show "dupefilter/filtered" count. It means scrapy filtered duplicate URLs. Scraping will work if you remove "parse_content" function and write code like this:
def parse(self,response):
if 'tags' in response.meta:
author = {}
author['url'] = response.url
name = response.css(".people-name::text").extract()
join_date = response.css(".joined-time::text").extract()
following_no = response.css(".following-number::text").extract()
followed_no = response.css(".followed-number::text").extract_first()
first_onsale = response.css(".first-onsale-date::text").extract()
total_no = response.css(".total-number::text").extract()
comments = total_no[0]
onsale = total_no[1]
columns = total_no[2]
ebooks = total_no[3]
essays = total_no[4]
author['tags'] = response.meta['tags']
author['name'] = name
author['join_date'] = join_date
author['following_no'] = following_no
author['followed_no'] = followed_no
author['first_onsale'] = first_onsale
author['comments'] = comments
author['onsale'] = onsale
author['columns'] = columns
author['ebooks'] = ebooks
author['essays'] = essays
yield author
authors = response.css('section.following-agents ul.bd li.item')
for author in authors:
tags = author.css('div.author-tags::text').extract_first()
url = author.css('a.lnk-avatar::attr(href)').extract_first()
yield response.follow(url=url, callback=self.parse, meta={'tags': tags})
Be carefull. I removed some lines during testing. You need to use random agents in HTTP headers, request delay or proxy. I run collection and now I got "403 Forbidden" status code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.