[英]Scrapy crawler is running, but not following any links. (craigslist)
出於某種原因,我的抓取工具只抓取了幾個域。 它應該至少跟隨起始頁面上的所有網址。 此外,這是在craigslist上,我不確定他們是否以阻止爬蟲而聞名。 知道發生了什么事嗎?
這是輸出:
2012-07-01 15:02:56-0400 [craigslist] INFO: Spider opened
2012-07-01 15:02:56-0400 [craigslist] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-01 15:02:56-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6026
2012-07-01 15:02:56-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6083
2012-07-01 15:02:57-0400 [craigslist] DEBUG: Crawled (200) <GET http://boston.craigslist.org/search/fua?query=chest+of+drawers> (referer: None)
2012-07-01 15:02:57-0400 [craigslist] DEBUG: Crawled (200) <GET http://boston.craigslist.org/fua/> (referer: None)
2012-07-01 15:02:57-0400 [craigslist] DEBUG: Filtered offsite request to 'boston.craigslist.org': <GET http://boston.craigslist.org/sob/fud/3112540401.html>
2012-07-01 15:02:57-0400 [craigslist] INFO: Closing spider (finished)
這是代碼:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from craigslist.items import CraigslistItem
from scrapy.http import Request
class BostonCragistlistSpider(CrawlSpider):
name = 'craigslist'
allowed_domains = ['http://boston.craigslist.org']
start_urls = ['http://boston.craigslist.org/search/fua?query=chest+of+drawers']
rules = (
Rule(SgmlLinkExtractor(allow=r'\/[a-z]{3}\/[a-z]{3}\/.*\.html'), callback='get_image', follow=True),
# looking for domains like:
# http://boston.craigslist.org/sob/fud/3111565340.html
# http://boston.craigslist.org/gbs/fuo/3112103005.html
Rule(SgmlLinkExtractor(allow=r'\/search\/fua\?query=\.*'), callback='extract_links', follow=True),
)
def extract_links(self, response):
print 'extracting links'
links = hxs.select('//p[@class="row"]//a/@href').extract()
for link in links:
return Request(link, callback=self.get_image)
def get_image(self, response):
print 'parsing'
hxs = HtmlXPathSelector(response)
images = hxs.select('//img//@src').extract()
任何想法將不勝感激!
allowed_domains需要包含域名,而不是URL。 將其更改為:
allowed_domains = ['boston.craigslist.org']
您可以從日志中看到請求被異地中間件(刪除allowed_domains之外的URL的組件)過濾掉了:
15:02:57-0400 [craigslist] DEBUG: Filtered offsite request to 'boston.craigslist.org': http://boston.craigslist.org/sob/fud/3112540401.html>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.