適用於MySQL的Python scrapy持久數據庫連接

Question

我正在為我的項目之一使用scrapy。 數據從蜘蛛抓取，並傳遞到管道以插入數據庫。 這是我的數據庫類代碼：

import MySQLdb


class Database:

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

這是我的管道代碼，用於處理抓取的項目並將其保存到MySQL數據庫中。

from con import Database 

class LinkPipeline(object):

    def __init__(self):
        self.db=Database()

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        self.db.insert(query,params)
        return item

從上面的流程中，我感覺到每當通過管道處理Item時，就在process_item完成時打開和關閉數據庫連接。 這將打開太多的數據庫連接。 我想要一種方法，使我的數據庫連接在Spider的整個生命周期中僅打開一次，而在Spider關閉時關閉。

我閱讀了Spider類中的open_spider和close_spider方法，如果使用它們，那么如何將對數據庫連接的引用從Spider的start_requests方法傳遞給管道類？

有沒有更好的方法來解決呢？

Answer 1

class MySpider(scrapy.Spider):
    name = "myspidername"

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

然后在Pipeline中執行spider.cursor來訪問cursor並執行任何MySQL操作。

class LinkPipeline(object):

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        spider.cursor.insert(query,params)
        return item

適用於MySQL的Python scrapy持久數據庫連接

問題描述

1 個解決方案

解決方案1
2 已采納 2018-04-10 11:45:45

適用於MySQL的Python scrapy持久數據庫連接

問題描述

1 個解決方案

解決方案1 2 已采納 2018-04-10 11:45:45

解決方案1
2 已采納 2018-04-10 11:45:45