简体   繁体   中英

Python Scrapy Spider: Inconsistent results

I would love to know what you guys think about this please. I have researched for a few days now and I can't seem to find where I am going wrong. Any help will be highly appreciated.

I want to systematically crawl this url: Question site using the pagination to crawl the rest of the pages.

My current code:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule

from acer.items import AcerItem


class AcercrawlerSpider(CrawlSpider):
    name = 'acercrawler'
    allowed_domains = ['studyacer.com']
    start_urls = ['http://www.studyacer.com/latest']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        questions= Selector(response).xpath('//td[@class="word-break"]/a/@href').extract()

        for question in questions:
            item= AcerItem()
            item['title']= question.xpath('//h1/text()').extract()
            item['body']= Selector(response).xpath('//div[@class="row-fluid"][2]//p/text()').extract()
            yield item

When I ran the spider it doesn't throw any errors but instead outputs inconsistent results. Sometimes scraping an article page twice. I am thinking it might be something to do with the selectors I have used but I can't narrow it any further. Any help with this please?

kevin; I had a similar but slightly different problem earlier today, where my crawlspider was visiting unwanted pages. Someone responded to my question with the suggestion of checking the linkextractor as you suggested here : http://doc.scrapy.org/en/latest/topics/link-extractors.html

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

I ended up reviewing my allow / deny components to focus the crawler on to specific subsets of pages. You can specify using regex to express the relevant substrings of the links to allow (include) or deny (exclude). I tested the expressions using http://www.regexpal.com/

I found this approach was sufficient to prevent duplicates, but if you're still seeing them, I also found this article I was looking at earlier in the day on how to prevent duplicates, although I have to say I didn't have to implement this fix:

Avoid Duplicate URL Crawling

https://stackoverflow.com/a/21344753/6582364

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM