简体   繁体   中英

scrapy not following links with no error

The url below is both used to extract content and be followed, but nothing happened after the content extracted. Don't know why it was not followed.

It seems no errors.

在此处输入图片说明

You run Request of author url twice. First time to scrape list of authors. Second time to scrape current author details. Dumping Scrapy stats (in the end of logging) show "dupefilter/filtered" count. It means scrapy filtered duplicate URLs. Scraping will work if you remove "parse_content" function and write code like this:

def parse(self,response):

    if 'tags' in response.meta:
        author = {}
        author['url'] = response.url

        name = response.css(".people-name::text").extract()
        join_date = response.css(".joined-time::text").extract()
        following_no = response.css(".following-number::text").extract()
        followed_no = response.css(".followed-number::text").extract_first()
        first_onsale = response.css(".first-onsale-date::text").extract()
        total_no = response.css(".total-number::text").extract()
        comments = total_no[0]
        onsale = total_no[1]
        columns = total_no[2]
        ebooks = total_no[3]
        essays = total_no[4]

        author['tags'] = response.meta['tags']
        author['name'] = name
        author['join_date'] = join_date
        author['following_no'] = following_no
        author['followed_no'] = followed_no
        author['first_onsale'] = first_onsale
        author['comments'] = comments
        author['onsale'] = onsale
        author['columns'] = columns
        author['ebooks'] = ebooks
        author['essays'] = essays

        yield author

    authors = response.css('section.following-agents ul.bd li.item')
    for author in authors:
        tags = author.css('div.author-tags::text').extract_first()
        url = author.css('a.lnk-avatar::attr(href)').extract_first()
        yield response.follow(url=url, callback=self.parse, meta={'tags': tags})

Be carefull. I removed some lines during testing. You need to use random agents in HTTP headers, request delay or proxy. I run collection and now I got "403 Forbidden" status code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM