[英]Scrapy Spider not Following Links
I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. 我正在写一个scrapy蜘蛛从主页抓取今天的NYT文章,但由于某种原因它不遵循任何链接。 When I instantiate the link extractor in
scrapy shell http://www.nytimes.com
, it successfully extracts a list of article urls with le.extract_links(response)
, but I can't get my crawl command ( scrapy crawl nyt -o out.json
) to scrape anything but the homepage. 当我在
scrapy shell http://www.nytimes.com
实例化链接提取器时,它成功地使用le.extract_links(response)
提取了一个文章URL列表,但我无法获取我的抓取命令( scrapy crawl nyt -o out.json
)除了主页以外的任何东西。 I'm sort of at my wit's end. 我有点像我的智慧。 Is it because the homepage does not yield an article from the parse function?
是因为主页不会从解析函数中产生文章吗? Any help is greatly appreciated.
任何帮助是极大的赞赏。
from datetime import date
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ..items import NewsArticle
with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()
with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]
today = date.today().strftime('%Y/%m/%d')
print today
class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)
def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article
I have found the solution to my problem. 我找到了解决问题的方法。 I was doing 2 things wrong:
我做错了两件事:
CrawlSpider
rather than Spider
if I wanted it to automatically crawl sublinks. CrawlSpider
我需要CrawlSpider
而不是Spider
。 CrawlSpider
, I needed to use a callback function rather than overriding parse
. CrawlSpider
,我需要使用回调函数而不是重写parse
。 As per the docs, overriding parse
breaks CrawlSpider
functionality. parse
会破坏CrawlSpider
功能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.