簡體   English   中英

抓癢的python從蜘蛛到管道傳遞start_urls

[英]scrapy python pass start_urls from spider to pipeline

我想將start_urls從我的蜘蛛傳遞到Mysqlpipeline。

我怎樣才能做到這一點?

這是我的spider.py的一部分

def __init__(self, *args, **kwargs):
    urls = kwargs.pop('urls', [])
    if urls:
        self.start_urls = urls.split(',')
    self.logger.info(self.start_urls)
    url = "".join(urls)
    self.allowed_domains = [url.split('/')[-1]]
    super(SeekerSpider, self).__init__(*args, **kwargs)

這是我的pipeline.py

class MySQLPipeline(object):
    def __init__(self):

        ...

        #  get the url from the spiders
        start_url = SeekerSpider.start_urls  # not working    

        url = "".join(start_url).split('/')[-1]
        self.tablename = url.split('.')[0]

UPDATE

這是我嘗試的另一種方法,但是如果我有100個請求...它將創建表100次...

pipeline.py

class MySQLPipeline(object):
    def __init__(self):
       ...

    def process_item(self, item, spider):
       tbl_name = item['tbl_name']
        general_table = """ CREATE TABLE IF NOT EXISTS CrawledTables
                            (id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
                            Name VARCHAR(100) NOT NULL,
                            Date VARCHAR(100) NOT NULL,
                            PRIMARY KEY (id), UNIQUE KEY (NAME))
                            ENGINE=Innodb DEFAULT CHARSET=utf8 """

        insert_table = """ INSERT INTO CrawledTables (Name,Date) VALUES(%s,%s)"""

        self.cursor.execute(general_table)
        crawled_date = datetime.datetime.now().strftime("%y/%m/%d-%H:%M")
        self.cursor.execute(insert_table, (tbl_name,
                                           str(crawled_date)))

        ...

spider.py

def __init__(self, *args, **kwargs):
    urls = kwargs.pop('urls', [])
    if urls:
        self.start_urls = urls.split(',')
    self.logger.info(self.start_urls)
    url = "".join(urls)
    self.allowed_domains = [url.split('/')[-1]]
    super(SeekerSpider, self).__init__(*args, **kwargs)

    self.date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M")
    self.dmn = "".join(self.allowed_domains).replace(".", "_")

    tablename = urls.split('/')[-1]
    table_name = tablename.split('.')[0]
    newname = table_name[:1].upper() + table_name[1:]
    date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M")
    self.tbl_name = newname + "_" + date

def parse_page(self, response):

    item = CrawlerItem()
    item['tbl_name'] = self.tbl_name

    ...

在此表中,我嘗試僅將要爬網的日期添加到表的1倍...基本上,我正在使用start_urls並將其傳遞給allowed_domain ,然后將其傳遞給tbl_name (用於mysql表名稱) )

我發現我需要在管道中創建另一個函數

def open_spider(self, spider):

這將采用蜘蛛中所有的參數,並在管道中使用它們

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM