scrapy not following links with no error

Question

The url below is both used to extract content and be followed, but nothing happened after the content extracted. Don't know why it was not followed.

It seems no errors.

Answer 1

You run Request of author url twice. First time to scrape list of authors. Second time to scrape current author details. Dumping Scrapy stats (in the end of logging) show "dupefilter/filtered" count. It means scrapy filtered duplicate URLs. Scraping will work if you remove "parse_content" function and write code like this:

def parse(self,response):

    if 'tags' in response.meta:
        author = {}
        author['url'] = response.url

        name = response.css(".people-name::text").extract()
        join_date = response.css(".joined-time::text").extract()
        following_no = response.css(".following-number::text").extract()
        followed_no = response.css(".followed-number::text").extract_first()
        first_onsale = response.css(".first-onsale-date::text").extract()
        total_no = response.css(".total-number::text").extract()
        comments = total_no[0]
        onsale = total_no[1]
        columns = total_no[2]
        ebooks = total_no[3]
        essays = total_no[4]

        author['tags'] = response.meta['tags']
        author['name'] = name
        author['join_date'] = join_date
        author['following_no'] = following_no
        author['followed_no'] = followed_no
        author['first_onsale'] = first_onsale
        author['comments'] = comments
        author['onsale'] = onsale
        author['columns'] = columns
        author['ebooks'] = ebooks
        author['essays'] = essays

        yield author

    authors = response.css('section.following-agents ul.bd li.item')
    for author in authors:
        tags = author.css('div.author-tags::text').extract_first()
        url = author.css('a.lnk-avatar::attr(href)').extract_first()
        yield response.follow(url=url, callback=self.parse, meta={'tags': tags})

Be carefull. I removed some lines during testing. You need to use random agents in HTTP headers, request delay or proxy. I run collection and now I got "403 Forbidden" status code.

scrapy not following links with no error

Question

1 answers

solution1
0 2018-05-31 21:37:43

scrapy not following links with no error

Question

1 answers

solution1 0 2018-05-31 21:37:43

solution1
0 2018-05-31 21:37:43