簡體   English   中英

使用來自scrapy spider的抓取數據加載heroku pg數據庫。\\

[英]Loading heroku pg database with scraped data from scrapy spider.\

我是heroku pg的消息。 我在這里所做的是編寫了一個爬網爬蟲,該爬蟲運行無任何錯誤。 問題是我想將所有抓取的數據放入heroku postgres數據庫中。 為此,我在一定程度上遵循了本教程

當我使用scrapy crawl spidername在本地機器上運行crawler時,它將成功運行,但是未插入scrapy crawl spidername數據,也不在heroku數據庫上創建任何表。 我什至在本地終端上都沒有得到任何錯誤。 這是我的代碼是...

settings.py

BOT_NAME = 'crawlerconnectdatabase'

SPIDER_MODULES = ['crawlerconnectdatabase.spiders']
NEWSPIDER_MODULE = 'crawlerconnectdatabase.spiders'

DATABASE = {'drivername': 'postgres',
        'host': 'ec2-54-235-250-41.compute-1.amazonaws.com',
        'port': '5432',
        'username': 'dtxwjcycsaweyu',
        'password': '***',
        'database': 'ddcir2p1u2vk07'}

items.py

from scrapy.item import Item, Field

class CrawlerconnectdatabaseItem(Item):
    name = Field()
    url = Field()
    title = Field()
    link = Field()
    page_title = Field()
    desc_link = Field()
    body = Field()
    news_headline = Field()
    pass

models.py

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings

DeclarativeBase = declarative_base()


def db_connect():

    return create_engine(URL(**settings.DATABASE))


def create_deals_table(engine):

    DeclarativeBase.metadata.create_all(engine)


class Deals(DeclarativeBase):
"""Sqlalchemy deals model"""
    __tablename__ = "news_data"

    id = Column(Integer, primary_key=True)
    body = Column('body', String)

pipelines.py

from sqlalchemy.orm import sessionmaker
from models import Deals, db_connect, create_deals_table

class CrawlerconnectdatabasePipeline(object):

    def __init__(self):
        engine = db_connect()
        create_deals_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        session = self.Session()
        deal = Deals(**item)

        try:
            session.add(deal)
            session.commit()
        except:
            session.rollback()
            raise
        finally:
            session.close()

        return item

蜘蛛

蜘蛛的代碼,您將在這里找到

您需要將ITEM_PIPELINES = {'crawlerconnectdatabase.pipelines.CrawlerconnectdatabasePipeline':300,}添加到settings.py

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM