简体   繁体   English

Python Scrapy - start_url 中有多个变量

[英]Python Scrapy - multiple variables in start_url

I want to make make my start_url a bit more dynamic than it currently is but my adjusted code doesn't seem to work.我想让我的 start_url 比现在更具动态性,但我调整后的代码似乎不起作用。

In order to make it more dynamic, I've added 2 more variables (month and day), and changed over to using the start_requests method instead of start_urls , however, the scraper now returns zero items:为了使其更具动态性,我添加了另外 2 个变量(月和日),并改为使用start_requests方法而不是start_urls ,但是,刮板现在返回零项:

import scrapy

class SuhbaSpider(scrapy.Spider):
    name = "suhbaDate"
# old working line of code 
#    start_urls = ["http://saltanat.org/videos.php?date={yyyy}-06-15".format(yyyy=yyyy) for yyyy in range(2013,2020)]

# new block of code (replaced start_urls with start_requests), not working
    def start_requests(self):
        for yyyy in range(2013,2020):
            for mm in range(12,12):
                for dd in range(14,15):
                    url = "http://saltanat.org/videos.php?date={yyyy}-{mm}-{dd}".format(yyyy=yyyy,mm=mm,dd=dd) 
                    yield Request(url, meta={'start_url':url}, callback=self.parse)
                    print(yyyy,mm,dd,url)

    def parse(self, response):
        for video in response.xpath("//tr[@class='video-doclet-row']"):
            item = dict()
            item["video"] = video.xpath(".//span[@class='download make-cursor']/a/@href").extract_first()

            videoid = video.xpath(".//span[@class='media-info make-cursor']/@onclick").extract_first()
            url = "http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2]
            request = scrapy.Request(url, callback=self.parse_transcript)
            request.meta['item'] = item
            yield request

    def parse_transcript(self, response):
        item = response.meta['item']
        item["transcript"] = response.xpath("//a[contains(@href,'english')]/@href").extract_first()
        yield item

Any assistance will be appreciated任何帮助将不胜感激

So there were a couple of problems with the above code code所以上面的代码代码有几个问题

  1. Looking at the Traceback output it was found that the actual problem was this: NameError: global name 'Request' is not defined which seems to be a bug查看 Traceback output 发现实际问题是这样的: NameError: global name 'Request' is not defined这似乎是一个错误

  2. The url in question needed a leading zero for the mm and dd variables有问题的 url 需要为mmdd变量添加前导零

The solution解决方案

  1. Included this line at the top of the script from scrapy.http.request import Requestfrom scrapy.http.request import Request的脚本顶部包含这一行

  2. Rewrote the start_requests loop with itertools and zfill:用 itertools 和 zfill 重写了start_requests循环:

def start_requests(self):
    for yyyy,mm,dd in itertools.product(range(2013,2020),range(6,7),range(14,22)):
        mm = str(mm).zfill(2)
        dd = str(dd).zfill(2)
        url = "http://saltanat.org/videos.php?date={0}-{1}-{2}".format(yyyy,mm,dd) 
        yield Request(url, meta={'start_url':url}, callback=self.parse)

* ignore the actual dates, those are for testing *忽略实际日期,用于测试

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM