Python Scrapy - start_url 中有多个变量

Question

I want to make make my start_url a bit more dynamic than it currently is but my adjusted code doesn't seem to work.我想让我的 start_url 比现在更具动态性，但我调整后的代码似乎不起作用。

In order to make it more dynamic, I've added 2 more variables (month and day), and changed over to using the start_requests method instead of start_urls , however, the scraper now returns zero items:为了使其更具动态性，我添加了另外 2 个变量（月和日），并改为使用start_requests方法而不是start_urls ，但是，刮板现在返回零项：

import scrapy

class SuhbaSpider(scrapy.Spider):
    name = "suhbaDate"
# old working line of code 
#    start_urls = ["http://saltanat.org/videos.php?date={yyyy}-06-15".format(yyyy=yyyy) for yyyy in range(2013,2020)]

# new block of code (replaced start_urls with start_requests), not working
    def start_requests(self):
        for yyyy in range(2013,2020):
            for mm in range(12,12):
                for dd in range(14,15):
                    url = "http://saltanat.org/videos.php?date={yyyy}-{mm}-{dd}".format(yyyy=yyyy,mm=mm,dd=dd) 
                    yield Request(url, meta={'start_url':url}, callback=self.parse)
                    print(yyyy,mm,dd,url)

    def parse(self, response):
        for video in response.xpath("//tr[@class='video-doclet-row']"):
            item = dict()
            item["video"] = video.xpath(".//span[@class='download make-cursor']/a/@href").extract_first()

            videoid = video.xpath(".//span[@class='media-info make-cursor']/@onclick").extract_first()
            url = "http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2]
            request = scrapy.Request(url, callback=self.parse_transcript)
            request.meta['item'] = item
            yield request

    def parse_transcript(self, response):
        item = response.meta['item']
        item["transcript"] = response.xpath("//a[contains(@href,'english')]/@href").extract_first()
        yield item

Any assistance will be appreciated任何帮助将不胜感激

Answer 1

So there were a couple of problems with the above code code所以上面的代码代码有几个问题

Looking at the Traceback output it was found that the actual problem was this: NameError: global name 'Request' is not defined which seems to be a bug查看 Traceback output 发现实际问题是这样的： NameError: global name 'Request' is not defined这似乎是一个错误
The url in question needed a leading zero for the mm and dd variables有问题的 url 需要为mm和dd变量添加前导零

The solution解决方案

Included this line at the top of the script from scrapy.http.request import Request在from scrapy.http.request import Request的脚本顶部包含这一行
Rewrote the start_requests loop with itertools and zfill:用 itertools 和 zfill 重写了start_requests循环：

def start_requests(self):
    for yyyy,mm,dd in itertools.product(range(2013,2020),range(6,7),range(14,22)):
        mm = str(mm).zfill(2)
        dd = str(dd).zfill(2)
        url = "http://saltanat.org/videos.php?date={0}-{1}-{2}".format(yyyy,mm,dd) 
        yield Request(url, meta={'start_url':url}, callback=self.parse)

* ignore the actual dates, those are for testing *忽略实际日期，用于测试

Python Scrapy - start_url 中有多个变量

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-06-14 02:26:37

Python Scrapy - start_url 中有多个变量

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-06-14 02:26:37

解决方案1
0 已采纳 2021-06-14 02:26:37