[英]Scrapy crawler spider doesn't follow links
For this, I used example in Scrapy crawl spider example: http://doc.scrapy.org/en/latest/topics/spiders.html 为此,我在Scrapy爬行蜘蛛示例中使用了示例: http ://doc.scrapy.org/en/latest/topics/spiders.html
I want to get links from a web page and follow them to parse table with statistics, but somehow I don't see that any links would be grabbed and followed to web page that has data. 我想从网页中获取链接,并按照它们来分析具有统计信息的表,但是以某种方式,我看不到任何链接都会被抓住并跟随到具有数据的网页中。 Here is my script:
这是我的脚本:
from basketbase.items import BasketbaseItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
class Basketspider(CrawlSpider):
name = "basketsp"
allowed_domains = ["euroleague.net"]
start_urls = ["http://www.euroleague.net/main"]
rules = (
Rule(SgmlLinkExtractor(allow=("results/by-date?seasoncode=E2000")),follow=True),
Rule(SgmlLinkExtractor(allow=("showgame?gamecode=165&seasoncode=E2000#!boxscore")), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
sel = HtmlXPathSelector(response)
items=[]
item = BasketbaseItem()
item['date'] = sel.select('//div[@class="gs-dates"]/text()').extract() # Game date
item['time'] = sel.select('//div[@class="gs-dates"]/span[@class="GameScoreTimeContainer"]/text()').extract() # Game time
item['stage'] = sel.select('//div[@class="gs-dates"]/text()').extract() # Stage of tournament
item['home'] = sel.select('//div[@class="gs-teams"]/a[@class="localClub"]/text()').extract() #Home team
item['guest'] = sel.select('//div[@class="gs-teams"]/a[@class="roadClub"]/text()').extract() #Visitor team
item['referees'] = sel.select('//span[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_lblReferees"]/text()').extract() #Referees
item['attendance'] = sel.select('//span[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_lblAudience"]/text()').extract()
item['fst'] = sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[2]/td[2][@class="AlternatingColumn"]/text()').extract()+sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[3]/td[2][@class="AlternatingColumn"]/text()').extract()
item['snd'] = sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[2]/td[3][@class="NormalColumn"]/text()').extract()+sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[3]/td[3][@class="NormalColumn"]/text()').extract()
item['trd'] = sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[2]/td[4][@class="AlternatingColumn"]/text()').extract()+sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[3]/td[4][@class="AlternatingColumn"]/text()').extract()
item['tth'] = sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[2]/td[5][@class="NormalColumn"]/text()').extract()+sel.select('//table[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_PartialsStatsByQuarter_dgPartials"]//tr[3]/td[5][@class="NormalColumn"]/text()').extract()
item['xt1'] = sel.select('//div[@class="gs-dates"]/text()').extract()
item['xt2'] = sel.select('//div[@class="gs-dates"]/text()').extract()
item['xt3'] = sel.select('//div[@class="gs-dates"]/text()').extract()
item['xt4'] = sel.select('//div[@class="gs-dates"]/text()').extract()
item['game_id'] = sel.select('//span[@id="ctl00_ctl00_ctl00_ctl00_maincontainer_maincenter_contentpane_boxscorepane_ctl00_lblReferees"]/text()').extract() # Game ID construct
item['arena'] = sel.select('//div[@class="gs-dates"]/text()').extract() #Arena
item['result'] = sel.select('//span[@class="score"]/text()').extract() #Result
item['league'] = sel.select('//div[@class="gs-dates"]/text()').extract() #League
print item['date'],item['time'], item['stage'], item['home'],item['guest'],item['referees'],item['attendance'],item['fst'],item['snd'],item['trd'],item['tth'],item['result']
items.append(item)
And here I have response from terminal: 我在这里收到终端的回复:
scrapy crawl basketsp
2013-11-17 01:40:15+0200 [scrapy] INFO: Scrapy 0.16.2 started (bot: basketbase)
2013-11-17 01:40:15+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-11-17 01:40:15+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-11-17 01:40:15+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-11-17 01:40:15+0200 [scrapy] DEBUG: Enabled item pipelines:
2013-11-17 01:40:15+0200 [basketsp] INFO: Spider opened
2013-11-17 01:40:15+0200 [basketsp] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-11-17 01:40:15+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-11-17 01:40:15+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-11-17 01:40:15+0200 [basketsp] DEBUG: Crawled (200) <GET http://www.euroleague.net/main> (referer: None)
2013-11-17 01:40:15+0200 [basketsp] INFO: Closing spider (finished)
2013-11-17 01:40:15+0200 [basketsp] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 228,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 9018,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 11, 16, 23, 40, 15, 496752),
'log_count/DEBUG': 7,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 11, 16, 23, 40, 15, 229125)}
2013-11-17 01:40:15+0200 [basketsp] INFO: Spider closed (finished)
What I am doing, wrong here? 我在做什么,错在这里吗? Any ideas would be great help.
任何想法都会有很大帮助。 I tried to leave SgmlLinkExtractor() empty that all links would be followed, but I get the same situation.
我试图将SgmlLinkExtractor()留空,以确保所有链接都可以被跟踪,但是我遇到了同样的情况。 There's no indication that crawler spider works at all.
没有迹象表明爬行蜘蛛完全可以工作。
I'm running Scrapy version 0.16.2 on Python 2.7.2+ 我正在Python 2.7.2+上运行Scrapy版本0.16.2
Scrapy is misinterpreting the content type of the start url. Scrapy误解了起始URL的内容类型。
You can verify this by using scrapy shell: 您可以使用scrapy shell进行验证:
$ scrapy shell 'http://www.euroleague.net/main'
2013-11-18 16:39:26+0900 [scrapy] INFO: Scrapy 0.21.0 started (bot: scrapybot)
...
AttributeError: 'Response' object has no attribute 'body_as_unicode'
See my previous answer about the missing body_as_unicode attribute. 有关丢失的body_as_unicode属性,请参见我以前的答案 。 I notice that the server does not set any content-type header.
我注意到服务器未设置任何内容类型标头。
CrawlSpider ignores non-html responses , so the responses are not processed and no links are followed. CrawlSpider会忽略非html响应 ,因此不会处理响应,也不会跟随任何链接。
I would suggest opening a issue on github, as I think Scrapy should be able to handle this case transparently. 我建议在github上发布一个问题,因为我认为Scrapy应该能够透明地处理此案。
As a work around you could override the CrawlSpider parse
method, create an HtmlResponse
from the response object passed, and pass that to the superclass parse
method. 解决方法是,您可以重写CrawlSpider
parse
方法,从传递的响应对象创建HtmlResponse
,然后将其传递给超类parse
方法。
在允许的域之前加上“ www”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.