简体   繁体   English

适用于MySQL的Python scrapy持久数据库连接

[英]Python scrapy persistent database connection for MySQL

I am using scrapy for one of my project. 我正在为我的项目之一使用scrapy。 The data gets scraped from spider and gets passed to pipeline for insertion into database. 数据从蜘蛛抓取,并传递到管道以插入数据库。 Here is my database class code: 这是我的数据库类代码:

import MySQLdb


class Database:

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

Here is my pipeline code that processes scraped items and saves into MySQL database. 这是我的管道代码,用于处理抓取的项目并将其保存到MySQL数据库中。

from con import Database 

class LinkPipeline(object):

    def __init__(self):
        self.db=Database()

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        self.db.insert(query,params)
        return item

From above flow I feel whenever Item is processed via pipeline then a database connection is opened and closed when process_item is complete. 从上面的流程中,我感觉到每当通过管道处理Item时,就在process_item完成时打开和关闭数据库连接。 This would open too much database connections. 这将打开太多的数据库连接。 I want a way where my database connection is only opened once during the whole life cycle of spider and closed when spider is closed. 我想要一种方法,使我的数据库连接在Spider的整个生命周期中仅打开一次,而在Spider关闭时关闭。

I read there are open_spider and close_spider method in Spider class, if I use them then how can I pass the reference to database connection from Spider's start_requests method to pipeline class? 我阅读了Spider类中的open_spider和close_spider方法,如果使用它们,那么如何将对数据库连接的引用从Spider的start_requests方法传递给管道类?

Are there any better approaches to go about it? 有没有更好的方法来解决呢?

class MySpider(scrapy.Spider):
    name = "myspidername"

    host = 'localhost'
    user = 'root'
    password = 'test123'
    db = 'scraping_db'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
        self.cursor = self.connection.cursor()

    def insert(self, query,params):
        try:
            self.cursor.execute(query,params)
            self.connection.commit()
        except Exception as ex:
            self.connection.rollback()


    def __del__(self):
        self.connection.close()

then in your Pipeline do this spider.cursor to access cursor and perform any MySQL operation. 然后在Pipeline中执行spider.cursor来访问cursor并执行任何MySQL操作。

class LinkPipeline(object):

    def process_item(self, item, spider):
        query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
        params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
        spider.cursor.insert(query,params)
        return item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM