Python Scrapy Spider：結果不一致

Question

我很想知道你們對此的想法。 我已經研究了幾天，但似乎找不到我要去哪里。 任何幫助將不勝感激。

我想系統地爬網此url：使用分頁對其余頁面進行爬網的問題站點。

我當前的代碼：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule

from acer.items import AcerItem


class AcercrawlerSpider(CrawlSpider):
    name = 'acercrawler'
    allowed_domains = ['studyacer.com']
    start_urls = ['http://www.studyacer.com/latest']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        questions= Selector(response).xpath('//td[@class="word-break"]/a/@href').extract()

        for question in questions:
            item= AcerItem()
            item['title']= question.xpath('//h1/text()').extract()
            item['body']= Selector(response).xpath('//div[@class="row-fluid"][2]//p/text()').extract()
            yield item

當我運行Spider時，它不會引發任何錯誤，而是會輸出不一致的結果。 有時會刮刮文章頁面兩次。 我認為這可能與我使用的選擇器有關，但我無法進一步縮小選擇范圍。 請幫忙嗎？

Answer 1

凱文 今天早些時候，我遇到了一個類似但略有不同的問題，我的爬網程序正在訪問不需要的頁面。 有人回答了我的問題，並建議您按照此處的建議檢查linkextractor： http ://doc.scrapy.org/en/latest/topics/link-extractors.html

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

最后，我回顧了我的允許/拒絕組件，以將搜尋器集中於頁面的特定子集。 您可以使用正則表達式指定表達鏈接的相關子字符串，以允許（包括）或拒絕（排除）。 我使用http://www.regexpal.com/測試了這些表達式

我發現這種方法足以防止重復，但是如果您仍然看到它們，我還會發現我當天早些時候正在研究如何防止重復的這篇文章，盡管我不得不說我不必實施此修復程序：

避免重復爬網

https://stackoverflow.com/a/21344753/6582364

Python Scrapy Spider：結果不一致

問題描述

1 個解決方案

解決方案1
0 2016-08-08 17:04:09

Python Scrapy Spider：結果不一致

問題描述

1 個解決方案

解決方案1 0 2016-08-08 17:04:09

解決方案1
0 2016-08-08 17:04:09