简体   繁体   English

如何将循环的 url 列表传递给 Scrapy (url="")

[英]How to pass a looped list of urls to Scrapy (url="")

I have a loop that creates links I want to scrape:我有一个循环创建我想要抓取的链接:

    start_date = date(2020, 1, 1)
    end_date = date.today()
    crawl_date = start_date
    base_url = ""https://www.racingpost.com/results/""
    links = []
    # Generate the links
    while crawl_date <= end_date:
        links.append(base_url + str(crawl_date))
        crawl_date += timedelta(days=1)

If I print "links", it works fine and I get the urls I want.如果我打印“链接”,它工作正常,我得到我想要的网址。

Then I have a spider, that scrapes the site just as well, if I put in the url manually.然后我有一个蜘蛛,如果我手动输入 url,它也会抓取网站。 Now I tried to pass the "links" variable containing the url I want to scrape as below, but I get "undefined variable" back.现在我尝试传递包含 url 我想如下刮的“链接”变量,但我得到了“未定义的变量”。

class RpresultSpider(scrapy.Spider):
    name = 'rpresult'
    allowed_domains = ['www.racingpost.com']
        script = '''
        function main(splash, args)
            url = args.url
            assert(splash:go(url))
            
            return splash:html()
        end
        '''
        def start_requests(self):
            yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
                                args={
                                    'lua_source': self.script
                                })
            
        def parse(self, response):
            for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
                yield {
                    'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
                } 
                    

How do I pass the generated links into SplashRequest(url=links如何将生成的链接传递给SplashRequest(url=links

Thanks so much for helping me out - I am still new to this and making small steps - most of them backward...非常感谢您帮助我-我对此仍然很陌生并且迈出了一小步-其中大部分都在倒退...

From my comment above (I'm not quite sure if this works because I'm unfamiliar with scrapy. However, the obvious problem is there is no reference to the links variable in the RpresultSpider class. Putting the loop that generates urls inside the function would fix that. From my comment above (I'm not quite sure if this works because I'm unfamiliar with scrapy. However, the obvious problem is there is no reference to the links variable in the RpresultSpider class. Putting the loop that generates urls inside the function会解决这个问题。

class RpresultSpider(scrapy.Spider):
    name = 'rpresult'
    allowed_domains = ['www.racingpost.com']
        script = '''
        function main(splash, args)
            url = args.url
            assert(splash:go(url))
            
            return splash:html()
        end
        '''
        def start_requests(self):
            start_date = date(2020, 1, 1)
            end_date = date.today()
            crawl_date = start_date
            base_url = ""https://www.racingpost.com/results/""
            links = []
            # Generate the links
            while crawl_date <= end_date:
                links.append(base_url + str(crawl_date))
                crawl_date += timedelta(days=1)
            yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
                                args={
                                    'lua_source': self.script
                                })
            
        def parse(self, response):
            for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
                yield {
                    'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
                } 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM