提示:本站收集StackOverFlow近2千万问答,支持中英文搜索,鼠标放在语句上弹窗显示对应的参考中文或英文, 本站还提供 中文繁体 英文版本 中英对照 版本,有任何建议请联系yoyou2525@163.com。
我有一个循环创建我想要抓取的链接:
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
如果我打印“链接”,它工作正常,我得到我想要的网址。
然后我有一个蜘蛛,如果我手动输入 url,它也会抓取网站。 现在我尝试传递包含 url 我想如下刮的“链接”变量,但我得到了“未定义的变量”。
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}
如何将生成的链接传递给SplashRequest(url=links
非常感谢您帮助我-我对此仍然很陌生并且迈出了一小步-其中大部分都在倒退...
From my comment above (I'm not quite sure if this works because I'm unfamiliar with scrapy. However, the obvious problem is there is no reference to the links variable in the RpresultSpider class. Putting the loop that generates urls inside the function会解决这个问题。
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.