sqlalchemy和PostgreSQL中的巨大表

Question

我在Postgresql数据库中有一个表，它的行数约为900,000。 在转换每一行并将数据添加到新列之后，我想将它逐行复制到具有一些额外列的另一个表中。 问题是RAM变满了。

这是代码的相关部分：

engine = sqlalchemy.create_engine(URL(**REMOTE), echo=False)
Session = sessionmaker(bind=engine)
session = Session()
n=1000
counter=1
for i in range(1,total+1,n):
    ids=str([j for j in range(i,i+n)])
    **q="SELECT * from table_parts where id in (ids)"%ids**
    r=session.execute(q).fetchall()
    for element in r:
        data={}
        ....
       [taking data from each row, extracting string,calculation,
        and filling extra columns that the new table has]
       ...
    query=query.bindparams(**data)
    try:
        session.execute(query)
    except:
        session.rollback()
        raise 
    if counter%n==0:
        print COMMITING....",counter,datetime.datetime.now("%H:%M:%S")
           session.commit()
    counter+=1

查询是正确的，因此没有错误。 在按Ctrl + C之前，新表已正确更新。

问题似乎出现在查询中：“ SELECT * from table_parts where id in (1,2,3,4...1000) ”我已经尝试过使用postgresql数组。

我已经尝试过的事情：

results = (connection .execution_options(stream_results=True) # Added this line .execute(query)) 从此处开始 results = (connection .execution_options(stream_results=True) # Added this line .execute(query)) 。 据我所知，当与postgresql一起使用时，它将使用服务器端游标。 我放弃了已发布代码中的会话对象，并使用了engine.connect()
- 在每次迭代中创建一个新的连接对象，令人惊讶的是，这也不起作用。 RAM已满

从文档中

请注意，如果使用yield_per（）方法，则会自动启用stream_results执行选项。

因此来自查询api的yield_per与上面提到的stream_result选项相同

谢谢

Answer 1

create table table_parts ( id serial primary key, data text );
-- Insert 1M rows of about 32kB data =~ 32GB of data
-- Needs only 0.4GB of disk space because of builtin compression
-- Might take a few minutes
insert into table_parts(data)
  select rpad('',32*1024,'A') from generate_series(1,1000000);

以下使用SQLAlchemy.Core的代码不占用大量内存：

import sqlalchemy
import datetime
import getpass

metadata = sqlalchemy.MetaData()
table_parts = sqlalchemy.Table('table_parts', metadata,
    sqlalchemy.Column('id', sqlalchemy.Integer, primary_key=True),
    sqlalchemy.Column('data', sqlalchemy.String)
)

engine = sqlalchemy.create_engine(
    'postgresql:///'+getpass.getuser(),
    echo=False
)
connection = engine.connect()

n = 1000

select_table_parts_n = sqlalchemy.sql.select([table_parts]).\
    where(table_parts.c.id>sqlalchemy.bindparam('last_id')).\
    order_by(table_parts.c.id).\
    limit(n)

update_table_parts = table_parts.update().\
    where(table_parts.c.id == sqlalchemy.bindparam('table_part_id')).\
    values(data=sqlalchemy.bindparam('table_part_data'))

last_id=0
while True:
    with connection.begin() as transaction:
        row = None
        for row in connection.execute(select_table_parts_n, last_id=last_id):
            data = row.data.replace('A','B')
            connection.execute(
                update_table_parts,
                table_part_id=row.id,
                table_part_data=data
            )
        if not row:
            break
        else:
            print "COMMITING {} {:%H:%M:%S}".\
                format(row.id,datetime.datetime.now())
            transaction.commit()
            last_id=row.id

您似乎没有使用ORM功能，所以我想您也应该使用SQLAlchemy.Core。

sqlalchemy和PostgreSQL中的巨大表

问题描述

1 个解决方案

解决方案1
0 2017-03-13 23:27:15

sqlalchemy和PostgreSQL中的巨大表

问题描述

1 个解决方案

解决方案1 0 2017-03-13 23:27:15

解决方案1
0 2017-03-13 23:27:15