简体   繁体   中英

Retrieving data from database scrapy

In scrapy I am trying to retrieve data from database which were scraped with spider and added to the database in pipelines.py. I want this data use in another spider. Specifically I want retrieve links from database and use it in start_request function.I know that this problem is explained also here Scrapy: Get Start_Urls from Database by Pipeline and I tried to do it by this example but unfortunately it is not working and I don't know why, but I know that I made mistake somewhere.

piplines.py
import sqlite3

class HeurekaScraperPipeline:

    def __init__(self):
        self.create_connection()
        self.create_table()

    def create_connection(self):
        self.conn = sqlite3.connect('shops.db')
        self.curr = self.conn.cursor()

    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS shops_tb""")
        self.curr.execute("""create table shops_tb(
                        product_name text, 
                        shop_name text, 
                        price text, 
                        link text
                        )""")

    def process_item(self, item, spider):
        self.store_db(item)
        return item

    def store_db(self, item):
        self.curr.execute("""insert into shops_tb values (?, ?, ?, ?)""",(
            item['product_name'],
            item['shop_name'],
            item['price'],
            item['link'],
        ))

        self.conn.commit()
spider
class Shops_spider(scrapy.Spider):
    name = 'shops_scraper'
    custom_settings = {'DOWNLOAD_DELAY': 1}
    def start_requests(self):
        db_cursor = HeurekaScraperPipeline().curr
        db_cursor.execute("SELECT * FROM shops_tb")

        links = db_cursor.fetchall()
        for link in links:
            url = link[3]
            print(url)
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        url = response.request.url
        print('********************************'+url+'************************')

In advance thanks for help.

Pipelines are for processing the items. If you want to read something from database, open the connection and read it in start_request . As per the documentation :

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

Why not open the DB connection in the start_request?

def start_requests(self):
        self.conn = sqlite3.connect('shops.db')
        self.curr = self.conn.cursor()
        self.curr.execute("SELECT * FROM shops_tb")
        links = self.curr.fetchall()
        # rest of the code

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM