[英]Scrapy - How to keep track of start url
Given a pool of start urls I would like to identify in the parse_item() function the origin url. 给定一个起始URL池,我想在parse_item()函数中标识原始URL。
As far as I'm concerned the scrapy spiders start crawling from the initial pool of start urls, but when parsing there is no trace of which of those urls was the initial one. 就我而言,那些抓爬虫开始从起始网址的初始池中爬网,但是在解析时,没有痕迹是那些起始网址中的哪个。 How it would be possible to keep track of the starting point?
如何跟踪起点?
If you need a parsing url inside the spider, just use response.url: 如果您需要在Spider内部使用解析网址,则只需使用response.url:
def parse_item(self, response):
print response.url
but in case you need it outside spider I can think of following ways: 但是如果您需要它在Spider之外,我可以考虑以下方法:
in scrapycaller.py 在scrapycaller.py中
from subprocess import call
urls = 'url1,url2'
cmd = 'scrapy crawl myspider -a myurls={}'.format(urls)
call(cmd, shell=True)
Inside myspider: 在myspider内部:
class mySpider(scrapy.Spider):
def __init__(self, myurls=''):
self.start_urls = myurls.split(",")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.