简体   繁体   中英

Pandas dataframe to Dockerized Postgres using SQLAlchemy

One line summary: I would like to 1) Spin up a Postgres database that runs in docker 2) Populate this PostgreSQL database with a Pandas data frame using SQLAlchemy from outside the container .


Docker runs fine:

CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                    NAMES
27add831cce5        postgres:10.1-alpine     "docker-entrypoint.s…"   2 weeks ago         Up 2 weeks          5432/tcp                 django-postgres_db_1

I've been able to find posts on getting a pandas data frame to Postgres, and using SQLAlchemy to create a table in a Dockerized Postgres. Stitching that together I get the following that (sort of) works:

import numpy as np
import pandas as pd

from sqlalchemy import create_engine
from sklearn.datasets import load_iris


def get_iris():

    iris = load_iris()

    return pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                        columns=iris['feature_names'] + ['target'])

df = get_iris()

print(df.head(n=5))

engine = create_engine(
    'postgresql://postgres:mysecretpassword@localhost:5432/postgres'.format(
    'django-postgres_db_1'))

df.to_sql('iris', engine)

Questions :

q.1 ) Is the above close to the preferred way of doing this?

q.2 ) Is there a way to create a db in Postgres using SQLAlchemy? Eg so I don't have to manually add a new db or populate the default Postgres one.


Problems :

p.1 ) When I run the create_engine that 'works' I get the following error:

  File "/home/tmo/projects/toy-pipeline/venv/lib/python3.5/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 683, in do_executemany
    cursor.executemany(statement, parameters)
KeyError: 'sepal length (cm'

However, if I run the code again, it says that the iris table already exists. If I manually access the Postgres db and do postgres=# TABLE iris it returns nothing.

p.2 ) I have a table in my Postgres db running in Docker called testdb

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 testdb    | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
(4 rows)

but if I try to insert that table in the create_engine I get an error:

conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  database "testdb" does not exist

(notice how postgres has been replaced by testdb ):

engine = create_engine(
    'postgresql://postgres:mysecretpassword@localhost:5432/testdb'.format(
    'django-postgres_db_1'))

Update :

So, I think I've figured out what the problem might be: A incorrect use of hostname and address. I should mention that I am running on a Azure instance, on Ubuntu 16.04.

Here are some useful info on the container that is running Postgres:

HOSTNAME=96402054abb3
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/postgresql/10/bin
PGDATA=/var/lib/postgresql/data
PG_MAJOR=10
PG_VERSION=10.5-1.pgdg90+1

And on etc/hosts

127.0.0.1   localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.2  96402054abb3

How do I construct my connection string properly? I've tried:

Container name as suggested here :

engine = create_engine(
    'postgresql://postgres:saibot@{}:5432/testdb'.format(
    'c101519547f8e89c3422ca9e1dc68d85ad9f24bd8e049efb37273782540646f0'))

OperationalError: (psycopg2.OperationalError) could not translate host name "96402054abb3" to address: Name or service not known

and I've tried putting in the ip, localhost , HOSTNAME etc. with no luck.

I am using this snippet of code to test if the db connects:

from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists

engine = create_engine(
    'postgresql://postgres:saibot@172.17.0.2/testdb')

database_exists(engine.url)

I solved this by inserting the host ip of the container: 172.17.0.2 into the connection string as such:

'postgresql://postgres:mysecretpasswd@172.17.0.2/raw_data'

Which in combination with a function solved my problem:

def db_create(engine_url, dataframe):
    """
    Check if postgres db exists, if not creates it
    """

    engine = create_engine(engine_url)

    if not database_exists(engine.url):
        print("Database does not exist, creating...")
        create_database(engine.url)

    print("Does it exist now?", database_exists(engine.url))

    if database_exists(engine.url):
        data_type = str(engine.url).rsplit('/', 1)[1]
        print('Populating database with', data_type)
        dataframe.to_sql(data_type, engine)

db_create('postgresql://postgres:mysecretpasswd@172.17.0.2/raw_data')

will create a database called raw_data with a table called raw_data, and populate it with the target Pandas data frame.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM