简体   繁体   English

Scrapy-如何跟踪起始URL

[英]Scrapy - How to keep track of start url

Given a pool of start urls I would like to identify in the parse_item() function the origin url. 给定一个起始URL池,我想在parse_item()函数中标识原始URL。

As far as I'm concerned the scrapy spiders start crawling from the initial pool of start urls, but when parsing there is no trace of which of those urls was the initial one. 就我而言,那些抓爬虫开始从起始网址的初始池中爬网,但是在解析时,没有痕迹是那些起始网址中的哪个。 How it would be possible to keep track of the starting point? 如何跟踪起点?

If you need a parsing url inside the spider, just use response.url: 如果您需要在Spider内部使用解析网址,则只需使用response.url:

def parse_item(self, response):
    print response.url 

but in case you need it outside spider I can think of following ways: 但是如果您需要它在Spider之外,我可以考虑以下方法:

  1. Use scrapy core api 使用scrapy core API
  2. You can also call scrapy from an external python module with OS command (which apparently is not recommended): 您还可以使用OS命令从外部python模块调用scrapy(显然不建议这样做):

in scrapycaller.py 在scrapycaller.py中

from subprocess import call
urls = 'url1,url2'
cmd = 'scrapy crawl myspider -a myurls={}'.format(urls)
call(cmd, shell=True)

Inside myspider: 在myspider内部:

class mySpider(scrapy.Spider):
    def __init__(self, myurls=''):              
        self.start_urls = myurls.split(",") 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM