In scrapy I am trying to retrieve data from database which were scraped with spider and added to the database in pipelines.py. I want this data use in another spider. Specifically I want retrieve links from database and use it in start_request function.I know that this problem is explained also here Scrapy: Get Start_Urls from Database by Pipeline and I tried to do it by this example but unfortunately it is not working and I don't know why, but I know that I made mistake somewhere.
piplines.py
import sqlite3
class HeurekaScraperPipeline:
def __init__(self):
self.create_connection()
self.create_table()
def create_connection(self):
self.conn = sqlite3.connect('shops.db')
self.curr = self.conn.cursor()
def create_table(self):
self.curr.execute("""DROP TABLE IF EXISTS shops_tb""")
self.curr.execute("""create table shops_tb(
product_name text,
shop_name text,
price text,
link text
)""")
def process_item(self, item, spider):
self.store_db(item)
return item
def store_db(self, item):
self.curr.execute("""insert into shops_tb values (?, ?, ?, ?)""",(
item['product_name'],
item['shop_name'],
item['price'],
item['link'],
))
self.conn.commit()
spider
class Shops_spider(scrapy.Spider):
name = 'shops_scraper'
custom_settings = {'DOWNLOAD_DELAY': 1}
def start_requests(self):
db_cursor = HeurekaScraperPipeline().curr
db_cursor.execute("SELECT * FROM shops_tb")
links = db_cursor.fetchall()
for link in links:
url = link[3]
print(url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
url = response.request.url
print('********************************'+url+'************************')
In advance thanks for help.
Pipelines are for processing the items. If you want to read something from database, open the connection and read it in start_request
. As per the documentation :
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.
Why not open the DB connection in the start_request?
def start_requests(self):
self.conn = sqlite3.connect('shops.db')
self.curr = self.conn.cursor()
self.curr.execute("SELECT * FROM shops_tb")
links = self.curr.fetchall()
# rest of the code
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.