[英]Python Scrapy - multiple variables in start_url
I want to make make my start_url a bit more dynamic than it currently is but my adjusted code doesn't seem to work.我想让我的 start_url 比现在更具动态性,但我调整后的代码似乎不起作用。
In order to make it more dynamic, I've added 2 more variables (month and day), and changed over to using the start_requests
method instead of start_urls
, however, the scraper now returns zero items:为了使其更具动态性,我添加了另外 2 个变量(月和日),并改为使用
start_requests
方法而不是start_urls
,但是,刮板现在返回零项:
import scrapy
class SuhbaSpider(scrapy.Spider):
name = "suhbaDate"
# old working line of code
# start_urls = ["http://saltanat.org/videos.php?date={yyyy}-06-15".format(yyyy=yyyy) for yyyy in range(2013,2020)]
# new block of code (replaced start_urls with start_requests), not working
def start_requests(self):
for yyyy in range(2013,2020):
for mm in range(12,12):
for dd in range(14,15):
url = "http://saltanat.org/videos.php?date={yyyy}-{mm}-{dd}".format(yyyy=yyyy,mm=mm,dd=dd)
yield Request(url, meta={'start_url':url}, callback=self.parse)
print(yyyy,mm,dd,url)
def parse(self, response):
for video in response.xpath("//tr[@class='video-doclet-row']"):
item = dict()
item["video"] = video.xpath(".//span[@class='download make-cursor']/a/@href").extract_first()
videoid = video.xpath(".//span[@class='media-info make-cursor']/@onclick").extract_first()
url = "http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2]
request = scrapy.Request(url, callback=self.parse_transcript)
request.meta['item'] = item
yield request
def parse_transcript(self, response):
item = response.meta['item']
item["transcript"] = response.xpath("//a[contains(@href,'english')]/@href").extract_first()
yield item
Any assistance will be appreciated任何帮助将不胜感激
So there were a couple of problems with the above code code所以上面的代码代码有几个问题
Looking at the Traceback output it was found that the actual problem was this: NameError: global name 'Request' is not defined
which seems to be a bug查看 Traceback output 发现实际问题是这样的:
NameError: global name 'Request' is not defined
这似乎是一个错误
The url in question needed a leading zero for the mm
and dd
variables有问题的 url 需要为
mm
和dd
变量添加前导零
The solution解决方案
Included this line at the top of the script from scrapy.http.request import Request
在
from scrapy.http.request import Request
的脚本顶部包含这一行
Rewrote the start_requests
loop with itertools and zfill:用 itertools 和 zfill 重写了
start_requests
循环:
def start_requests(self):
for yyyy,mm,dd in itertools.product(range(2013,2020),range(6,7),range(14,22)):
mm = str(mm).zfill(2)
dd = str(dd).zfill(2)
url = "http://saltanat.org/videos.php?date={0}-{1}-{2}".format(yyyy,mm,dd)
yield Request(url, meta={'start_url':url}, callback=self.parse)
* ignore the actual dates, those are for testing *忽略实际日期,用于测试
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.