简体   繁体   English

sqlalchemy和PostgreSQL中的巨大表

[英]sqlalchemy and huge table in postgresql

I have a table in a postgresql database that is ~900,000 rows. 我在Postgresql数据库中有一个表,它的行数约为900,000。 I want to copy it row by row to another table with some extra columns after transforming each row and adding data to the new columns. 在转换每一行并将数据添加到新列之后,我想将它逐行复制到具有一些额外列的另一个表中。 The problem is that RAM gets full. 问题是RAM变满了。

Here is the relevant part of the code: 这是代码的相关部分:

engine = sqlalchemy.create_engine(URL(**REMOTE), echo=False)
Session = sessionmaker(bind=engine)
session = Session()
n=1000
counter=1
for i in range(1,total+1,n):
    ids=str([j for j in range(i,i+n)])
    **q="SELECT * from table_parts where id in (ids)"%ids**
    r=session.execute(q).fetchall()
    for element in r:
        data={}
        ....
       [taking data from each row, extracting string,calculation,
        and filling extra columns that the new table has]
       ...
    query=query.bindparams(**data)
    try:
        session.execute(query)
    except:
        session.rollback()
        raise 
    if counter%n==0:
        print COMMITING....",counter,datetime.datetime.now("%H:%M:%S")
           session.commit()
    counter+=1

The queries are correct, so there is no errors there. 查询是正确的,因此没有错误。 Before I press Ctrl+C, the new table gets correctly updated. 在按Ctrl + C之前,新表已正确更新。

The problem seems to the query: " SELECT * from table_parts where id in (1,2,3,4...1000) " I already tried with a postgresql array. 问题似乎出现在查询中:“ SELECT * from table_parts where id in (1,2,3,4...1000) ”我已经尝试过使用postgresql数组。

Things I have already tried: 我已经尝试过的事情:

  • results = (connection .execution_options(stream_results=True) # Added this line .execute(query)) from here . results = (connection .execution_options(stream_results=True) # Added this line .execute(query)) 从此处开始 results = (connection .execution_options(stream_results=True) # Added this line .execute(query)) As far as I know this uses a server side cursor when used with postgresql. 据我所知,当与postgresql一起使用时,它将使用服务器端游标。 I ditched the session object I have in my posted code and used engine.connect() 我放弃了已发布代码中的会话对象,并使用了engine.connect()

    • creating a new connection object on each iteration, surprisingly this does not work either. 在每次迭代中创建一个新的连接对象,令人惊讶的是,这也不起作用。 RAM gets full RAM已满

from the documentation , 从文档中

Note that the stream_results execution option is enabled automatically if the yield_per() method is used. 请注意,如果使用yield_per()方法,则会自动启用stream_results执行选项。

so the yield_per from the query api is the same with the stream_result option mentioned above 因此来自查询api的yield_per与上面提到的stream_result选项相同

thanks 谢谢

create table table_parts ( id serial primary key, data text );
-- Insert 1M rows of about 32kB data =~ 32GB of data
-- Needs only 0.4GB of disk space because of builtin compression
-- Might take a few minutes
insert into table_parts(data)
  select rpad('',32*1024,'A') from generate_series(1,1000000);

This code below using SQLAlchemy.Core does not use a lot of memory: 以下使用SQLAlchemy.Core的代码不占用大量内存:

import sqlalchemy
import datetime
import getpass

metadata = sqlalchemy.MetaData()
table_parts = sqlalchemy.Table('table_parts', metadata,
    sqlalchemy.Column('id', sqlalchemy.Integer, primary_key=True),
    sqlalchemy.Column('data', sqlalchemy.String)
)

engine = sqlalchemy.create_engine(
    'postgresql:///'+getpass.getuser(),
    echo=False
)
connection = engine.connect()

n = 1000

select_table_parts_n = sqlalchemy.sql.select([table_parts]).\
    where(table_parts.c.id>sqlalchemy.bindparam('last_id')).\
    order_by(table_parts.c.id).\
    limit(n)

update_table_parts = table_parts.update().\
    where(table_parts.c.id == sqlalchemy.bindparam('table_part_id')).\
    values(data=sqlalchemy.bindparam('table_part_data'))

last_id=0
while True:
    with connection.begin() as transaction:
        row = None
        for row in connection.execute(select_table_parts_n, last_id=last_id):
            data = row.data.replace('A','B')
            connection.execute(
                update_table_parts,
                table_part_id=row.id,
                table_part_data=data
            )
        if not row:
            break
        else:
            print "COMMITING {} {:%H:%M:%S}".\
                format(row.id,datetime.datetime.now())
            transaction.commit()
            last_id=row.id

You don't seem to use ORM features, so I suppose you should also use SQLAlchemy.Core. 您似乎没有使用ORM功能,所以我想您也应该使用SQLAlchemy.Core。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM