简体   繁体   English

使用 SQLAlchemy 批量插入 Pandas 数据帧

[英]Bulk Insert A Pandas DataFrame Using SQLAlchemy

I have some rather large pandas DataFrames and I'd like to use the new bulk SQL mappings to upload them to a Microsoft SQL Server via SQL Alchemy.我有一些相当大的熊猫数据帧,我想使用新的批量 SQL 映射通过 SQL Alchemy 将它们上传到 Microsoft SQL Server。 The pandas.to_sql method, while nice, is slow. pandas.to_sql 方法虽然不错,但速度很慢。

I'm having trouble writing the code...我在写代码时遇到了麻烦...

I'd like to be able to pass this function a pandas DataFrame which I'm calling table , a schema name I'm calling schema , and a table name I'm calling name .我希望能够将此函数传递给我正在调用table的 pandas DataFrame ,我正在调用schema的模式名称,以及我正在调用name的表名。 Ideally, the function will 1.) delete the table if it already exists.理想情况下,该函数将 1.) 如果表已存在,则删除该表。 2.) create a new table 3.) create a mapper and 4.) bulk insert using the mapper and pandas data. 2.) 创建一个新表 3.) 创建一个映射器和 4.) 使用映射器和 pandas 数据进行批量插入。 I'm stuck on part 3.我被困在第 3 部分。

Here's my (admittedly rough) code.这是我的(诚然粗略的)代码。 I'm struggling with how to get the mapper function to work with my primary keys.我正在为如何让映射器功能与我的主键一起工作而苦苦挣扎。 I don't really need primary keys but the mapper function requires it.我真的不需要主键,但映射器功能需要它。

Thanks for the insights.感谢您的见解。

from sqlalchemy import create_engine Table, Column, MetaData
from sqlalchemy.orm import mapper, create_session
from sqlalchemy.ext.declarative import declarative_base
from pandas.io.sql import SQLTable, SQLDatabase

def bulk_upload(table, schema, name):
    e = create_engine('mssql+pyodbc://MYDB')
    s = create_session(bind=e)
    m = MetaData(bind=e,reflect=True,schema=schema)
    Base = declarative_base(bind=e,metadata=m)
    t = Table(name,m)
    m.remove(t)
    t.drop(checkfirst=True)
    sqld = SQLDatabase(e, schema=schema,meta=m)
    sqlt = SQLTable(name, sqld, table).table
    sqlt.metadata = m
    m.create_all(bind=e,tables=[sqlt])    
    class MyClass(Base):
        return
    mapper(MyClass, sqlt)    

    s.bulk_insert_mappings(MyClass, table.to_dict(orient='records'))
    return

I ran into a similar issue with pd.to_sql taking hours to upload data.我遇到了一个类似的问题,pd.to_sql 需要数小时上传数据。 The below code bulk inserted the same data in a few seconds.下面的代码批量在几秒钟内插入了相同的数据。

from sqlalchemy import create_engine
import psycopg2 as pg
#load python script that batch loads pandas df to sql
import cStringIO

address = 'postgresql://<username>:<pswd>@<host>:<port>/<database>'
engine = create_engine(address)
connection = engine.raw_connection()
cursor = connection.cursor()

#df is the dataframe containing an index and the columns "Event" and "Day"
#create Index column to use as primary key
df.reset_index(inplace=True)
df.rename(columns={'index':'Index'}, inplace =True)

#create the table but first drop if it already exists
command = '''DROP TABLE IF EXISTS localytics_app2;
CREATE TABLE localytics_app2
(
"Index" serial primary key,
"Event" text,
"Day" timestamp without time zone,
);'''
cursor.execute(command)
connection.commit()

#stream the data using 'to_csv' and StringIO(); then use sql's 'copy_from' function
output = cStringIO.StringIO()
#ignore the index
df.to_csv(output, sep='\t', header=False, index=False)
#jump to start of stream
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
#null values become ''
cur.copy_from(output, 'localytics_app2', null="")    
connection.commit()
cur.close()

This might have been answered by then, but I found the solution by collating different answers on this site and aligning with SQLAlchemy's doc.那时可能已经回答了这个问题,但我通过在这个站点上整理不同的答案并与 SQLAlchemy 的文档保持一致找到了解决方案。

  1. The table needs to already exist in db1;该表需要已经存在于 db1 中; with an index set up with auto_increment on.使用 auto_increment 设置索引。
  2. The Class Current needs to align with the dataframe imported in the CSV and the table in the db1.当前类需要与 CSV 中导入的数据框和 db1 中的表对齐。

Hope this helps whoever comes here and wants to mix Panda and SQLAlchemy in a quick way.希望这对来到这里并想要快速混合 Panda 和 SQLAlchemy 的人有所帮助。

from urllib import quote_plus as urlquote
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Numeric
from sqlalchemy.orm import sessionmaker
import pandas as pd


# Set up of the engine to connect to the database
# the urlquote is used for passing the password which might contain special characters such as "/"
engine = create_engine('mysql://root:%s@localhost/db1' % urlquote('weirdPassword*withsp€cialcharacters'), echo=False)
conn = engine.connect()
Base = declarative_base()

#Declaration of the class in order to write into the database. This structure is standard and should align with SQLAlchemy's doc.
class Current(Base):
    __tablename__ = 'tableName'

    id = Column(Integer, primary_key=True)
    Date = Column(String(500))
    Type = Column(String(500))
    Value = Column(Numeric())

    def __repr__(self):
        return "(id='%s', Date='%s', Type='%s', Value='%s')" % (self.id, self.Date, self.Type, self.Value)

# Set up of the table in db and the file to import
fileToRead = 'file.csv'
tableToWriteTo = 'tableName'

# Panda to create a lovely dataframe
df_to_be_written = pd.read_csv(fileToRead)
# The orient='records' is the key of this, it allows to align with the format mentioned in the doc to insert in bulks.
listToWrite = df_to_be_written.to_dict(orient='records')

metadata = sqlalchemy.schema.MetaData(bind=engine,reflect=True)
table = sqlalchemy.Table(tableToWriteTo, metadata, autoload=True)

# Open the session
Session = sessionmaker(bind=engine)
session = Session()

# Inser the dataframe into the database in one bulk
conn.execute(table.insert(), listToWrite)

# Commit the changes
session.commit()

# Close the session
session.close()

Based on @ansonw answers:基于@ansonw 的回答:

def to_sql(engine, df, table, if_exists='fail', sep='\t', encoding='utf8'):
    # Create Table
    df[:0].to_sql(table, engine, if_exists=if_exists)

    # Prepare data
    output = cStringIO.StringIO()
    df.to_csv(output, sep=sep, header=False, encoding=encoding)
    output.seek(0)

    # Insert data
    connection = engine.raw_connection()
    cursor = connection.cursor()
    cursor.copy_from(output, table, sep=sep, null='')
    connection.commit()
    cursor.close()

I insert 200000 lines in 5 seconds instead of 4 minutes我在 5 秒而不是 4 分钟内插入 200000 行

Pandas 0.25.1 has a parameter to do multi-inserts, so it's no longer necessary to workaround this issue with SQLAlchemy. Pandas 0.25.1 有一个参数来执行多插入,因此不再需要使用 SQLAlchemy 解决这个问题。

Set method='multi' when calling pandas.DataFrame.to_sql .调用pandas.DataFrame.to_sql时设置method='multi'

In this example, it would be df.to_sql(table, schema=schema, con=e, index=False, if_exists='replace', method='multi')在此示例中,它将是df.to_sql(table, schema=schema, con=e, index=False, if_exists='replace', method='multi')

Answer sourced from docs here答案来自此处的文档

Worth noting that I've only tested this with Redshift.值得注意的是,我只用 Redshift 对此进行了测试。 Please let me know how it goes on other databases so I can update this answer.请让我知道它在其他数据库上的运行情况,以便我更新这个答案。

As this is an I/O heavy workload you can also use the python threading module through multiprocessing.dummy .由于这是 I/O 繁重的工作负载,您还可以通过multiprocessing.dummy使用 python 线程模块。 This sped things up for me:这为我加快了速度:

import math
from multiprocessing.dummy import Pool as ThreadPool

...

def insert_df(df, *args, **kwargs):
    nworkers = 4

    chunksize = math.floor(df.shape[0] / nworkers)
    chunks = [(chunksize * i, (chunksize * i) + chunksize) for i in range(nworkers)]
    chunks.append((chunksize * nworkers, df.shape[0]))
    pool = ThreadPool(nworkers)

    def worker(chunk):
        i, j = chunk
        df.iloc[i:j, :].to_sql(*args, **kwargs)

    pool.map(worker, chunks)
    pool.close()
    pool.join()


....

insert_df(df, "foo_bar", engine, if_exists='append')

Here is Simple Method这是简单的方法

. .

Download Drivers for SQL database connectivity下载用于 SQL 数据库连接的驱动程序

For Linux and Mac OS:对于 Linux 和 Mac 操作系统:

https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017 https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017

For Windows:对于 Windows:

https://www.microsoft.com/en-us/download/details.aspx?id=56567 https://www.microsoft.com/en-us/download/details.aspx?id=56567

Creating Connection创建连接

from sqlalchemy import create_engine 
import urllib
server = '*****'
database = '********'
username = '**********'
password = '*********'

params = urllib.parse.quote_plus(
'DRIVER={ODBC Driver 17 for SQL Server};'+ 
'SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password) 

engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params) 

#Checking Connection 
connected = pd.io.sql._is_sqlalchemy_connectable(engine)

print(connected)   #Output is True if connection established successfully

Data insertion数据插入

df.to_sql('Table_Name', con=engine, if_exists='append', index=False)


"""
if_exists: {'fail', 'replace', 'append'}, default 'fail'
     fail: If table exists, do nothing.
     replace: If table exists, drop it, recreate it, and insert data.
     append: If table exists, insert data. Create if does not exist.
"""

If there are many records如果有很多记录

# limit based on sp_prepexec parameter count
tsql_chunksize = 2097 // len(bd_pred_score_100.columns)
# cap at 1000 (limit for number of rows inserted by table-value constructor)
tsql_chunksize = 1000 if tsql_chunksize > 1000 else tsql_chunksize
print(tsql_chunksize)


df.to_sql('table_name', con = engine, if_exists = 'append', index= False, chunksize=tsql_chunksize)

PS: You can change the parameters as per your requirement. PS:您可以根据需要更改参数。

My postgres specific solution below auto-creates the database table using your pandas dataframe, and performs a fast bulk insert using the postgres COPY my_table FROM ...下面我的 postgres 特定解决方案使用您的 pandas 数据框自动创建数据库表,并使用 postgres COPY my_table FROM ...

import io

import pandas as pd
from sqlalchemy import create_engine

def write_to_table(df, db_engine, schema, table_name, if_exists='fail'):
    string_data_io = io.StringIO()
    df.to_csv(string_data_io, sep='|', index=False)
    pd_sql_engine = pd.io.sql.pandasSQL_builder(db_engine, schema=schema)
    table = pd.io.sql.SQLTable(table_name, pd_sql_engine, frame=df,
                               index=False, if_exists=if_exists, schema=schema)
    table.create()
    string_data_io.seek(0)
    string_data_io.readline()  # remove header
    with db_engine.connect() as connection:
        with connection.connection.cursor() as cursor:
            copy_cmd = "COPY %s.%s FROM STDIN HEADER DELIMITER '|' CSV" % (schema, table_name)
            cursor.copy_expert(copy_cmd, string_data_io)
        connection.connection.commit()

for people like me who are trying to implement the aforementioned solutions:对于像我这样试图实施上述解决方案的人:

Pandas 0.24.0 has now to_sql with chunksize and method='multi' option that inserts in bulk... Pandas 0.24.0 现在有 to_sql 和 chunksize 和 method='multi' 选项,可以批量插入......

This worked for me to connect to Oracle Database using cx_Oracle and SQLALchemy这对我使用 cx_Oracle 和 SQLALchemy 连接到 Oracle 数据库很有用

import sqlalchemy
import cx_Oracle
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, String
from sqlalchemy.orm import sessionmaker
import pandas as pd

# credentials
username = "username"
password = "password"
connectStr = "connection:/string"
tableName = "tablename"

t0 = time.time()

# connection
dsn = cx_Oracle.makedsn('host','port',service_name='servicename')

Base = declarative_base()

class LANDMANMINERAL(Base):
    __tablename__ = 'tablename'

    DOCUMENTNUM = Column(String(500), primary_key=True)
    DOCUMENTTYPE = Column(String(500))
    FILENUM = Column(String(500))
    LEASEPAYOR = Column(String(500))
    LEASESTATUS = Column(String(500))
    PROSPECT = Column(String(500))
    SPLIT = Column(String(500))
    SPLITSTATUS = Column(String(500))

engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (username, password, dsn))
conn = engine.connect()  

Base.metadata.bind = engine

# Creating the session

DBSession = sessionmaker(bind=engine)

session = DBSession()

# Bulk insertion
data = pd.read_csv('data.csv')
lists = data.to_dict(orient='records')


table = sqlalchemy.Table('landmanmineral', Base.metadata, autoreload=True)
conn.execute(table.insert(), lists)

session.commit()

session.close() 

print("time taken %8.8f seconds" % (time.time() - t0) )

Below code might help you, i was facing the same issue while loading 695,000K records下面的代码可能会对您有所帮助,我在加载 695,000K 记录时遇到了同样的问题

Method Truncate the table before load方法在加载前截断表

with engine.begin() as conn:
     conn.execute(sa.text("TRUNCATE TABLE <schama.table>")

Note:- engine= my connection to destination server, sa for (import sqlalchemy as "sa"注意:-引擎=我到目标服务器的连接,sa for (import sqlalchemy as "sa"

table_name = "<destination_table>"
df.to_sql(table_name, engine, schema = 'schema', if_exists = 'replace', index=False)

depends on requirement do append/replace取决于要求做追加/替换

For anyone facing this problem and having the destination DB as Redshift, note that Redshift does not implement the full set of Postgres commands, and so some of the answers using either Postgres' COPY FROM or copy_from() will not work.对于遇到此问题并将目标数据库作为 Redshift 的任何人,请注意 Redshift 没有实现完整的 Postgres 命令集,因此使用 Postgres 的COPY FROMcopy_from()的某些答案将不起作用。 psycopg2.ProgrammingError: syntax error at or near "stdin" error when trying to copy_from redshift psycopg2.ProgrammingError:尝试从红移复制时出现“stdin”错误或接近“stdin”错误

Solution for speeding up the INSERTs to Redshift is to use a file ingest or Odo.加速插入 Redshift 的解决方案是使用文件摄取或 Odo。

Reference:参考:
About Odo http://odo.pydata.org/en/latest/perf.html关于 Odo http://odo.pydata.org/en/latest/perf.html
Odo with Redshift带红移的 Odo
https://github.com/blaze/odo/blob/master/docs/source/aws.rsthttps://github.com/blaze/odo/blob/master/docs/source/aws.rst
Redshift COPY (from S3 file) Redshift COPY(来自 S3 文件)
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM