[英]scrapy python pass start_urls from spider to pipeline
I want to pass the start_urls from my spider to the Mysqlpipeline. 我想将start_urls从我的蜘蛛传递到Mysqlpipeline。
how can I do that? 我怎样才能做到这一点?
This is part of my spider.py 这是我的spider.py的一部分
def __init__(self, *args, **kwargs):
urls = kwargs.pop('urls', [])
if urls:
self.start_urls = urls.split(',')
self.logger.info(self.start_urls)
url = "".join(urls)
self.allowed_domains = [url.split('/')[-1]]
super(SeekerSpider, self).__init__(*args, **kwargs)
and this is my pipeline.py 这是我的pipeline.py
class MySQLPipeline(object):
def __init__(self):
...
# get the url from the spiders
start_url = SeekerSpider.start_urls # not working
url = "".join(start_url).split('/')[-1]
self.tablename = url.split('.')[0]
UPDATE UPDATE
This is another way I tried but if I have 100 requests...it will create the table 100 times... 这是我尝试的另一种方法,但是如果我有100个请求...它将创建表100次...
pipeline.py pipeline.py
class MySQLPipeline(object):
def __init__(self):
...
def process_item(self, item, spider):
tbl_name = item['tbl_name']
general_table = """ CREATE TABLE IF NOT EXISTS CrawledTables
(id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
Name VARCHAR(100) NOT NULL,
Date VARCHAR(100) NOT NULL,
PRIMARY KEY (id), UNIQUE KEY (NAME))
ENGINE=Innodb DEFAULT CHARSET=utf8 """
insert_table = """ INSERT INTO CrawledTables (Name,Date) VALUES(%s,%s)"""
self.cursor.execute(general_table)
crawled_date = datetime.datetime.now().strftime("%y/%m/%d-%H:%M")
self.cursor.execute(insert_table, (tbl_name,
str(crawled_date)))
...
spider.py spider.py
def __init__(self, *args, **kwargs):
urls = kwargs.pop('urls', [])
if urls:
self.start_urls = urls.split(',')
self.logger.info(self.start_urls)
url = "".join(urls)
self.allowed_domains = [url.split('/')[-1]]
super(SeekerSpider, self).__init__(*args, **kwargs)
self.date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M")
self.dmn = "".join(self.allowed_domains).replace(".", "_")
tablename = urls.split('/')[-1]
table_name = tablename.split('.')[0]
newname = table_name[:1].upper() + table_name[1:]
date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M")
self.tbl_name = newname + "_" + date
def parse_page(self, response):
item = CrawlerItem()
item['tbl_name'] = self.tbl_name
...
In this table I am trying to add only 1 time the table that I'm crawling with the date...basically I'm taking the start_urls and pass it to the allowed_domain and then pass it to the tbl_name (for the mysql table name) 在此表中,我尝试仅将要爬网的日期添加到表的1倍...基本上,我正在使用start_urls并将其传递给allowed_domain ,然后将其传递给tbl_name (用于mysql表名称) )
I found out that i need to create another function in the pipeline 我发现我需要在管道中创建另一个函数
def open_spider(self, spider):
and this takes all the arguments that you have in your spider and you use them in the pipeline 这将采用蜘蛛中所有的参数,并在管道中使用它们
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.