简体   繁体   English

Python Pandas - 使用 to_sql 以块的形式写入大型数据帧

[英]Python Pandas - Using to_sql to write large data frames in chunks

I'm using Pandas' to_sql function to write to MySQL, which is timing out due to large frame size (1M rows, 20 columns).我正在使用 Pandas 的to_sql函数写入 MySQL,由于大帧大小(1M 行,20 列)而超时。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Is there a more official way to chunk through the data and write rows in blocks?有没有更正式的方法来分块数据并在块中写入行? I've written my own code, which seems to work.我已经编写了自己的代码,这似乎有效。 I'd prefer an official solution though.不过,我更喜欢官方解决方案。 Thanks!谢谢!

def write_to_db(engine, frame, table_name, chunk_size):

    start_index = 0
    end_index = chunk_size if chunk_size < len(frame) else len(frame)

    frame = frame.where(pd.notnull(frame), None)
    if_exists_param = 'replace'

    while start_index != end_index:
        print "Writing rows %s through %s" % (start_index, end_index)
        frame.iloc[start_index:end_index, :].to_sql(con=engine, name=table_name, if_exists=if_exists_param)
        if_exists_param = 'append'

        start_index = min(start_index + chunk_size, len(frame))
        end_index = min(end_index + chunk_size, len(frame))

engine = sqlalchemy.create_engine('mysql://...') #database details omited
write_to_db(engine, frame, 'retail_pendingcustomers', 20000)

Update: this functionality has been merged in pandas master and will be released in 0.15 (probably end of september), thanks to @artemyk!更新:此功能已合并到 pandas master 中,并将在 0.15(可能在 9 月底)发布,感谢 @artemyk! See https://github.com/pydata/pandas/pull/8062https://github.com/pydata/pandas/pull/8062

So starting from 0.15, you can specify the chunksize argument and eg simply do:因此,从 0.15 开始,您可以指定chunksize参数,例如只需执行以下操作:

df.to_sql('table', engine, chunksize=20000)

There is beautiful idiomatic function chunks provided in answer to this question在回答这个问题时提供了漂亮的惯用函数块

In your case you can use this function like this:在您的情况下,您可以像这样使用此功能:

def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
    for i in xrange(0, len(l), n):
         yield l.iloc[i:i+n]

def write_to_db(engine, frame, table_name, chunk_size):
    for idx, chunk in enumerate(chunks(frame, chunk_size)):
        if idx == 0:
            if_exists_param = 'replace':
        else:
            if_exists_param = 'append'
        chunk.to_sql(con=engine, name=table_name, if_exists=if_exists_param)

Only drawback that it doesn't support slicing second index in iloc function.唯一的缺点是它不支持在 iloc 函数中对第二个索引进行切片。

Reading from one table and writing to other in chunks....从一张表中读取并分块写入其他表......

[myconn1 ---> Source Table],[myconn2----> Target Table],[ch= 10000] [myconn1 ---> 源表],[myconn2----> 目标表],[ch= 10000]

for chunk in pd.read_sql_table(table_name=source, con=myconn1, chunksize=ch):
    chunk.to_sql(name=target, con=myconn2, if_exists="replace", index=False,
                 chunksize=ch)
    LOGGER.info(f"Done 1 chunk")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM