简体   繁体   English

Scrapy 爬虫:无法将多个 url 存储到 postgres

[英]Scrapy crawler: Unable to store multiple urls into postgres

I have created a crawler using scrapy python.I want to store multiple urls fetched by the crawler into the postgres table.When i start the crawler the urls are fetched and table gets created into the postgres but the data is not getting stored.我使用 scrapy python 创建了一个爬虫。我想将爬虫获取的多个 url 存储到 postgres 表中。当我启动爬虫时,会获取 url 并将表创建到 postgres 中,但数据没有被存储。

Technology used: Scrapy,Python使用技术: Scrapy,Python

Output to be: The urls should get stored inside the postgres table. Output 是: url 应该存储在 postgres 表中。

Error: I am unable to store all the urls.The crawler is not working for all the websites.错误:我无法存储所有网址。爬虫不适用于所有网站。

Please Help!!!请帮忙!!!

 import scrapy import os import psycopg2 conn = psycopg2.connect( database="postgres", user='postgres', password='password', host='127.0.0.1', port= '5432' ) print("connected") conn.autocommit = True cur=conn.cursor() cur.execute(""" CREATE TABLE IF NOT EXISTS tmp_crawler ( WEBSITE VARCHAR(500) NOT NULL ) """) class MySpider(scrapy.Spider): name = 'feed_exporter_test' allowed_domains=['google.com'] start_urls = ['https://www.google.com//'] def parse(self, response): urls = response.xpath("//a/@href").extract() for url in urls: abs_url = response.urljoin(url) var1 = "INSERT INTO tmp_crawler(website) VALUES('" + url + "')" cur.execute(var1) conn.commit() yield {'title': abs_url}

You can use scrapy ITEM_PIPELINES to achieve this.您可以使用scrapy ITEM_PIPELINES来实现此目的。 See sample implementation below请参阅下面的示例实现

import scrapy
import psycopg2

class DBPipeline(object):
    def open_spider(self, spider):
        # connect to database
        try:
            self.conn = psycopg2.connect(database = "postgres", user = "postgres", password = "password", host = "127.0.0.1", port = "5432")
            self.conn.autocommit = True
            self.cur = self.conn.cursor()
        except:
            spider.logger.error("Unable to connect to database") 

        # create the table
        try:
            self.cur.execute("CREATE TABLE IF NOT EXISTS tmp_crawler (website VARCHAR(500) NOT NULL);")
        except:
            spider.logger.error("Error creating table `tmp_crawler`") 

    def process_item(self, item, spider):
        try:
            self.cur.execute('INSERT INTO tmp_crawler (website) VALUES (%s)', (item.get('title'),))
            spider.logger.info("Item inserted to database")
        except Exception as e:
            spider.logger.error(f"Error `{e}` while inserting item <{item.get('title')}")
        return item

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()


class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    allowed_domains=['google.com']
    start_urls = ['https://www.google.com/'] 
    custom_settings = {
        'ITEM_PIPELINES': {
            DBPipeline: 500
        }
    }

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            yield {'title': response.urljoin(url)}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM