简体   繁体   English

Python Scrapy Spider:结果不一致

[英]Python Scrapy Spider: Inconsistent results

I would love to know what you guys think about this please. 我很想知道你们对此的想法。 I have researched for a few days now and I can't seem to find where I am going wrong. 我已经研究了几天,但似乎找不到我要去哪里。 Any help will be highly appreciated. 任何帮助将不胜感激。

I want to systematically crawl this url: Question site using the pagination to crawl the rest of the pages. 我想系统地爬网此url:使用分页对其余页面进行爬网的问题站点

My current code: 我当前的代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule

from acer.items import AcerItem


class AcercrawlerSpider(CrawlSpider):
    name = 'acercrawler'
    allowed_domains = ['studyacer.com']
    start_urls = ['http://www.studyacer.com/latest']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        questions= Selector(response).xpath('//td[@class="word-break"]/a/@href').extract()

        for question in questions:
            item= AcerItem()
            item['title']= question.xpath('//h1/text()').extract()
            item['body']= Selector(response).xpath('//div[@class="row-fluid"][2]//p/text()').extract()
            yield item

When I ran the spider it doesn't throw any errors but instead outputs inconsistent results. 当我运行Spider时,它不会引发任何错误,而是会输出不一致的结果。 Sometimes scraping an article page twice. 有时会刮刮文章页面两次。 I am thinking it might be something to do with the selectors I have used but I can't narrow it any further. 我认为这可能与我使用的选择器有关,但我无法进一步缩小选择范围。 Any help with this please? 请帮忙吗?

kevin; 凯文 I had a similar but slightly different problem earlier today, where my crawlspider was visiting unwanted pages. 今天早些时候,我遇到了一个类似但略有不同的问题,我的爬网程序正在访问不需要的页面。 Someone responded to my question with the suggestion of checking the linkextractor as you suggested here : http://doc.scrapy.org/en/latest/topics/link-extractors.html 有人回答了我的问题,并建议您按照此处的建议检查linkextractor: http ://doc.scrapy.org/en/latest/topics/link-extractors.html

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

I ended up reviewing my allow / deny components to focus the crawler on to specific subsets of pages. 最后,我回顾了我的允许/拒绝组件,以将搜寻器集中于页面的特定子集。 You can specify using regex to express the relevant substrings of the links to allow (include) or deny (exclude). 您可以使用正则表达式指定表达链接的相关子字符串,以允许(包括)或拒绝(排除)。 I tested the expressions using http://www.regexpal.com/ 我使用http://www.regexpal.com/测试了这些表达式

I found this approach was sufficient to prevent duplicates, but if you're still seeing them, I also found this article I was looking at earlier in the day on how to prevent duplicates, although I have to say I didn't have to implement this fix: 我发现这种方法足以防止重复,但是如果您仍然看到它们,我还会发现我当天早些时候正在研究如何防止重复的这篇文章,尽管我不得不说我不必实施此修复程序:

Avoid Duplicate URL Crawling 避免重复爬网

https://stackoverflow.com/a/21344753/6582364 https://stackoverflow.com/a/21344753/6582364

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM