使用 SQLAlchemy 批量插入 Pandas 數據幀

Question

我有一些相當大的熊貓數據幀，我想使用新的批量 SQL 映射通過 SQL Alchemy 將它們上傳到 Microsoft SQL Server。 pandas.to_sql 方法雖然不錯，但速度很慢。

我在寫代碼時遇到了麻煩...

我希望能夠將此函數傳遞給我正在調用table的 pandas DataFrame ，我正在調用schema的模式名稱，以及我正在調用name的表名。 理想情況下，該函數將 1.) 如果表已存在，則刪除該表。 2.) 創建一個新表 3.) 創建一個映射器和 4.) 使用映射器和 pandas 數據進行批量插入。 我被困在第 3 部分。

這是我的（誠然粗略的）代碼。 我正在為如何讓映射器功能與我的主鍵一起工作而苦苦掙扎。 我真的不需要主鍵，但映射器功能需要它。

感謝您的見解。

from sqlalchemy import create_engine Table, Column, MetaData
from sqlalchemy.orm import mapper, create_session
from sqlalchemy.ext.declarative import declarative_base
from pandas.io.sql import SQLTable, SQLDatabase

def bulk_upload(table, schema, name):
    e = create_engine('mssql+pyodbc://MYDB')
    s = create_session(bind=e)
    m = MetaData(bind=e,reflect=True,schema=schema)
    Base = declarative_base(bind=e,metadata=m)
    t = Table(name,m)
    m.remove(t)
    t.drop(checkfirst=True)
    sqld = SQLDatabase(e, schema=schema,meta=m)
    sqlt = SQLTable(name, sqld, table).table
    sqlt.metadata = m
    m.create_all(bind=e,tables=[sqlt])    
    class MyClass(Base):
        return
    mapper(MyClass, sqlt)    

    s.bulk_insert_mappings(MyClass, table.to_dict(orient='records'))
    return

Answer 1

我遇到了一個類似的問題，pd.to_sql 需要數小時上傳數據。 下面的代碼批量在幾秒鍾內插入了相同的數據。

from sqlalchemy import create_engine
import psycopg2 as pg
#load python script that batch loads pandas df to sql
import cStringIO

address = 'postgresql://<username>:<pswd>@<host>:<port>/<database>'
engine = create_engine(address)
connection = engine.raw_connection()
cursor = connection.cursor()

#df is the dataframe containing an index and the columns "Event" and "Day"
#create Index column to use as primary key
df.reset_index(inplace=True)
df.rename(columns={'index':'Index'}, inplace =True)

#create the table but first drop if it already exists
command = '''DROP TABLE IF EXISTS localytics_app2;
CREATE TABLE localytics_app2
(
"Index" serial primary key,
"Event" text,
"Day" timestamp without time zone,
);'''
cursor.execute(command)
connection.commit()

#stream the data using 'to_csv' and StringIO(); then use sql's 'copy_from' function
output = cStringIO.StringIO()
#ignore the index
df.to_csv(output, sep='\t', header=False, index=False)
#jump to start of stream
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
#null values become ''
cur.copy_from(output, 'localytics_app2', null="")    
connection.commit()
cur.close()

Answer 2

那時可能已經回答了這個問題，但我通過在這個站點上整理不同的答案並與 SQLAlchemy 的文檔保持一致找到了解決方案。

該表需要已經存在於 db1 中； 使用 auto_increment 設置索引。
needs to align with the dataframe imported in the CSV and the table in the db1.類需要與 CSV 中導入的數據框和 db1 中的表對齊。

希望這對來到這里並想要快速混合 Panda 和 SQLAlchemy 的人有所幫助。

from urllib import quote_plus as urlquote
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Numeric
from sqlalchemy.orm import sessionmaker
import pandas as pd


# Set up of the engine to connect to the database
# the urlquote is used for passing the password which might contain special characters such as "/"
engine = create_engine('mysql://root:%s@localhost/db1' % urlquote('weirdPassword*withsp€cialcharacters'), echo=False)
conn = engine.connect()
Base = declarative_base()

#Declaration of the class in order to write into the database. This structure is standard and should align with SQLAlchemy's doc.
class Current(Base):
    __tablename__ = 'tableName'

    id = Column(Integer, primary_key=True)
    Date = Column(String(500))
    Type = Column(String(500))
    Value = Column(Numeric())

    def __repr__(self):
        return "(id='%s', Date='%s', Type='%s', Value='%s')" % (self.id, self.Date, self.Type, self.Value)

# Set up of the table in db and the file to import
fileToRead = 'file.csv'
tableToWriteTo = 'tableName'

# Panda to create a lovely dataframe
df_to_be_written = pd.read_csv(fileToRead)
# The orient='records' is the key of this, it allows to align with the format mentioned in the doc to insert in bulks.
listToWrite = df_to_be_written.to_dict(orient='records')

metadata = sqlalchemy.schema.MetaData(bind=engine,reflect=True)
table = sqlalchemy.Table(tableToWriteTo, metadata, autoload=True)

# Open the session
Session = sessionmaker(bind=engine)
session = Session()

# Inser the dataframe into the database in one bulk
conn.execute(table.insert(), listToWrite)

# Commit the changes
session.commit()

# Close the session
session.close()

Answer 3

基於@ansonw 的回答：

def to_sql(engine, df, table, if_exists='fail', sep='\t', encoding='utf8'):
    # Create Table
    df[:0].to_sql(table, engine, if_exists=if_exists)

    # Prepare data
    output = cStringIO.StringIO()
    df.to_csv(output, sep=sep, header=False, encoding=encoding)
    output.seek(0)

    # Insert data
    connection = engine.raw_connection()
    cursor = connection.cursor()
    cursor.copy_from(output, table, sep=sep, null='')
    connection.commit()
    cursor.close()

我在 5 秒而不是 4 分鍾內插入 200000 行

Answer 4

Pandas 0.25.1 有一個參數來執行多插入，因此不再需要使用 SQLAlchemy 解決這個問題。

調用pandas.DataFrame.to_sql時設置method='multi' 。

在此示例中，它將是df.to_sql(table, schema=schema, con=e, index=False, if_exists='replace', method='multi')

答案來自此處的文檔

值得注意的是，我只用 Redshift 對此進行了測試。 請讓我知道它在其他數據庫上的運行情況，以便我更新這個答案。

Answer 5

由於這是 I/O 繁重的工作負載，您還可以通過multiprocessing.dummy使用 python 線程模塊。 這為我加快了速度：

import math
from multiprocessing.dummy import Pool as ThreadPool

...

def insert_df(df, *args, **kwargs):
    nworkers = 4

    chunksize = math.floor(df.shape[0] / nworkers)
    chunks = [(chunksize * i, (chunksize * i) + chunksize) for i in range(nworkers)]
    chunks.append((chunksize * nworkers, df.shape[0]))
    pool = ThreadPool(nworkers)

    def worker(chunk):
        i, j = chunk
        df.iloc[i:j, :].to_sql(*args, **kwargs)

    pool.map(worker, chunks)
    pool.close()
    pool.join()


....

insert_df(df, "foo_bar", engine, if_exists='append')

Answer 6

這是簡單的方法

.

下載用於 SQL 數據庫連接的驅動程序

對於 Linux 和 Mac 操作系統：

https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017

對於 Windows：

https://www.microsoft.com/en-us/download/details.aspx?id=56567

創建連接

from sqlalchemy import create_engine 
import urllib
server = '*****'
database = '********'
username = '**********'
password = '*********'

params = urllib.parse.quote_plus(
'DRIVER={ODBC Driver 17 for SQL Server};'+ 
'SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password) 

engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params) 

#Checking Connection 
connected = pd.io.sql._is_sqlalchemy_connectable(engine)

print(connected)   #Output is True if connection established successfully

數據插入

df.to_sql('Table_Name', con=engine, if_exists='append', index=False)


"""
if_exists: {'fail', 'replace', 'append'}, default 'fail'
     fail: If table exists, do nothing.
     replace: If table exists, drop it, recreate it, and insert data.
     append: If table exists, insert data. Create if does not exist.
"""

如果有很多記錄

# limit based on sp_prepexec parameter count
tsql_chunksize = 2097 // len(bd_pred_score_100.columns)
# cap at 1000 (limit for number of rows inserted by table-value constructor)
tsql_chunksize = 1000 if tsql_chunksize > 1000 else tsql_chunksize
print(tsql_chunksize)


df.to_sql('table_name', con = engine, if_exists = 'append', index= False, chunksize=tsql_chunksize)

PS：您可以根據需要更改參數。

Answer 7

下面我的 postgres 特定解決方案使用您的 pandas 數據框自動創建數據庫表，並使用 postgres COPY my_table FROM ...

import io

import pandas as pd
from sqlalchemy import create_engine

def write_to_table(df, db_engine, schema, table_name, if_exists='fail'):
    string_data_io = io.StringIO()
    df.to_csv(string_data_io, sep='|', index=False)
    pd_sql_engine = pd.io.sql.pandasSQL_builder(db_engine, schema=schema)
    table = pd.io.sql.SQLTable(table_name, pd_sql_engine, frame=df,
                               index=False, if_exists=if_exists, schema=schema)
    table.create()
    string_data_io.seek(0)
    string_data_io.readline()  # remove header
    with db_engine.connect() as connection:
        with connection.connection.cursor() as cursor:
            copy_cmd = "COPY %s.%s FROM STDIN HEADER DELIMITER '|' CSV" % (schema, table_name)
            cursor.copy_expert(copy_cmd, string_data_io)
        connection.connection.commit()

Answer 8

對於像我這樣試圖實施上述解決方案的人：

Pandas 0.24.0 現在有 to_sql 和 chunksize 和 method='multi' 選項，可以批量插入......

Answer 9

這對我使用 cx_Oracle 和 SQLALchemy 連接到 Oracle 數據庫很有用

import sqlalchemy
import cx_Oracle
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, String
from sqlalchemy.orm import sessionmaker
import pandas as pd

# credentials
username = "username"
password = "password"
connectStr = "connection:/string"
tableName = "tablename"

t0 = time.time()

# connection
dsn = cx_Oracle.makedsn('host','port',service_name='servicename')

Base = declarative_base()

class LANDMANMINERAL(Base):
    __tablename__ = 'tablename'

    DOCUMENTNUM = Column(String(500), primary_key=True)
    DOCUMENTTYPE = Column(String(500))
    FILENUM = Column(String(500))
    LEASEPAYOR = Column(String(500))
    LEASESTATUS = Column(String(500))
    PROSPECT = Column(String(500))
    SPLIT = Column(String(500))
    SPLITSTATUS = Column(String(500))

engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (username, password, dsn))
conn = engine.connect()  

Base.metadata.bind = engine

# Creating the session

DBSession = sessionmaker(bind=engine)

session = DBSession()

# Bulk insertion
data = pd.read_csv('data.csv')
lists = data.to_dict(orient='records')


table = sqlalchemy.Table('landmanmineral', Base.metadata, autoreload=True)
conn.execute(table.insert(), lists)

session.commit()

session.close() 

print("time taken %8.8f seconds" % (time.time() - t0) )

Answer 10

下面的代碼可能會對您有所幫助，我在加載 695,000K 記錄時遇到了同樣的問題

方法在加載前截斷表

with engine.begin() as conn:
     conn.execute(sa.text("TRUNCATE TABLE <schama.table>")

注意：-引擎=我到目標服務器的連接，sa for (import sqlalchemy as "sa"

table_name = "<destination_table>"
df.to_sql(table_name, engine, schema = 'schema', if_exists = 'replace', index=False)

取決於要求做追加/替換

Answer 11

對於遇到此問題並將目標數據庫作為 Redshift 的任何人，請注意 Redshift 沒有實現完整的 Postgres 命令集，因此使用 Postgres 的COPY FROM或copy_from()的某些答案將不起作用。 psycopg2.ProgrammingError：嘗試從紅移復制時出現“stdin”錯誤或接近“stdin”錯誤

加速插入 Redshift 的解決方案是使用文件攝取或 Odo。

參考：
關於 Odo http://odo.pydata.org/en/latest/perf.html
帶紅移的 Odo
https://github.com/blaze/odo/blob/master/docs/source/aws.rst
Redshift COPY（來自 S3 文件）
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

使用 SQLAlchemy 批量插入 Pandas 數據幀

問題描述

11 個解決方案

解決方案1
37 2015-11-04 18:39:45

解決方案2
25 2015-12-30 15:55:06

解決方案3
19 2017-05-25 11:36:16

解決方案4
13 2019-09-04 04:19:13

解決方案5
6 2017-02-10 16:26:42

解決方案6
5 2019-06-19 12:27:37

這是簡單的方法

下載用於 SQL 數據庫連接的驅動程序

對於 Linux 和 Mac 操作系統：

對於 Windows：

創建連接

數據插入

解決方案7
4 2017-08-01 19:09:41

解決方案8
4 2019-02-03 14:05:03

解決方案9
1 2019-01-04 15:04:19

解決方案10
0 2022-06-24 10:39:58

解決方案11
-3 2018-07-05 06:41:51

使用 SQLAlchemy 批量插入 Pandas 數據幀

問題描述

11 個解決方案

解決方案1 37 2015-11-04 18:39:45

解決方案2 25 2015-12-30 15:55:06

解決方案3 19 2017-05-25 11:36:16

解決方案4 13 2019-09-04 04:19:13

解決方案5 6 2017-02-10 16:26:42

解決方案6 5 2019-06-19 12:27:37

這是簡單的方法

下載用於 SQL 數據庫連接的驅動程序

對於 Linux 和 Mac 操作系統：

對於 Windows：

創建連接

數據插入

解決方案7 4 2017-08-01 19:09:41

解決方案8 4 2019-02-03 14:05:03

解決方案9 1 2019-01-04 15:04:19

解決方案10 0 2022-06-24 10:39:58

解決方案11 -3 2018-07-05 06:41:51

解決方案1
37 2015-11-04 18:39:45

解決方案2
25 2015-12-30 15:55:06

解決方案3
19 2017-05-25 11:36:16

解決方案4
13 2019-09-04 04:19:13

解決方案5
6 2017-02-10 16:26:42

解決方案6
5 2019-06-19 12:27:37

解決方案7
4 2017-08-01 19:09:41

解決方案8
4 2019-02-03 14:05:03

解決方案9
1 2019-01-04 15:04:19

解決方案10
0 2022-06-24 10:39:58

解決方案11
-3 2018-07-05 06:41:51