[英]Initialising a CrawlSpider in Scrapy
我在 Scrapy 中編寫了一個蜘蛛,它基本上做得很好,並且完全按照它應該做的。 問題是我需要對它做一些小的改變,我嘗試了幾種方法都沒有成功(例如修改 InitSpider)。 這是腳本現在應該執行的操作:
http://www.example.de/index/search?method=simple
http://www.example.de/index/search?filter=homepage
所以基本上所有需要改變的是在兩者之間調用一個 URL。 我寧願不使用 BaseSpider 重寫整個事情,所以我希望有人知道如何實現這一點:)
如果您需要任何其他信息,請告訴我。 您可以在下面找到當前腳本。
#!/usr/bin/python
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from example.items import ExampleItem
from scrapy.contrib.loader.processor import TakeFirst
import re
import urllib
take_first = TakeFirst()
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ["example.de"]
start_url = "http://www.example.de/index/search?method=simple"
start_urls = [start_url]
rules = (
# http://www.example.de/index/search?page=2
# http://www.example.de/index/search?page=1&tab=direct
Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*$', )), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*&tab=direct', )), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
# fetch all company entries
companies = hxs.select("//ul[contains(@class, 'directresults')]/li[contains(@id, 'entry')]")
items = []
for company in companies:
item = ExampleItem()
item['name'] = take_first(company.select(".//span[@class='fn']/text()").extract())
item['address'] = company.select(".//p[@class='data track']/text()").extract()
item['website'] = take_first(company.select(".//p[@class='customurl track']/a/@href").extract())
# we try to fetch the number directly from the page (only works for premium entries)
item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/text()").extract())
if not item['telephone']:
# if we cannot fetch the number it has been encoded on the client and hidden in the rel=""
item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/@rel").extract())
items.append(item)
return items
編輯
這是我對 InitSpider 的嘗試: https ://gist.github.com/150b30eaa97e0518673a 我從這里得到了這個想法: Crawling with an authentication session in Scrapy
如您所見,它仍然繼承自 CrawlSpider,但我對核心 Scrapy 文件進行了一些更改(不是我最喜歡的方法)。 我讓 CrawlSpider 繼承自 InitSpider 而不是 BaseSpider ( source )。
到目前為止,這是有效的,但蜘蛛只是在第一頁之后停止而不是拿起所有其他頁面。
此外,這種方法對我來說似乎完全沒有必要:)
好的,我自己找到了解決方案,它實際上比我最初想象的要簡單得多:)
這是簡化的腳本:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy import log
from scrapy.selector import HtmlXPathSelector
from example.items import ExampleItem
from scrapy.contrib.loader.processor import TakeFirst
import re
import urllib
take_first = TakeFirst()
class ExampleSpider(BaseSpider):
name = "ExampleNew"
allowed_domains = ["www.example.de"]
start_page = "http://www.example.de/index/search?method=simple"
direct_page = "http://www.example.de/index/search?page=1&tab=direct"
filter_page = "http://www.example.de/index/search?filter=homepage"
def start_requests(self):
"""This function is called before crawling starts."""
return [Request(url=self.start_page, callback=self.request_direct_tab)]
def request_direct_tab(self, response):
return [Request(url=self.direct_page, callback=self.request_filter)]
def request_filter(self, response):
return [Request(url=self.filter_page, callback=self.parse_item)]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
# fetch the items you need and yield them like this:
# yield item
# fetch the next pages to scrape
for url in hxs.select("//div[@class='limiter']/a/@href").extract():
absolute_url = "http://www.example.de" + url
yield Request(absolute_url, callback=self.parse_item)
正如您所看到的,我現在正在使用 BaseSpider 並在最后自己生成新的請求。 一開始,我簡單地介紹了在開始爬行之前需要發出的所有不同請求。
我希望這對某人有幫助:) 如果您有問題,我會很樂意回答。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.