简体   繁体   中英

transfer millions of records from sqlite to postgresql using python sqlalchemy

We have around 1500 sqlite dbs, each has 0 to 20,000,000 records in table (violation) total no of violation records is around 90,000,000.

Each file we generate by running a crawler on the 1500 servers. With this violation table we have some other tables too which we use for further analysis.

To analyze the results we push all these sqlite violation records into postsgres violation table, along with other insertion and other calculation.

Following is the code I use to transfer records,

class PolicyViolationService(object):

def __init__(self, pg_dao, crawler_dao_s):
    self.pg_dao = pg_dao
    self.crawler_dao_s = crawler_dao_s
    self.user_violation_count = defaultdict(int)
    self.analyzer_time_id = self.pg_dao.get_latest_analyzer_tracker()

def process(self):
    """
        transfer policy violation record from crawler db to analyzer db
    """
    for crawler_dao in self.crawler_dao_s:
        violations = self.get_violations(crawler_dao.get_violations())
        self.pg_dao.insert_rows(violations)

def get_violations(self, violation_records):
    for violation in violation_records:
        violation = dict(violation.items())
        violation.pop('id')
        self.user_violation_count[violation.get('user_id')] += 1
        violation['analyzer_time_id'] = self.analyzer_time_id
        yield PolicyViolation(**violation)

in sqlite dao
==============
def get_violations(self):
    result_set = self.db.execute('select * from policyviolations;')
    return result_set

in pg dao
=========
   def insert_rows(self, rows):
       self.session.add_all(rows)
       self.session.commit()

This code works but taking very log time. What is the right way to approach this problem. Please suggest, we have been discussing about parallel processing, skip sqlalchemy and some other options. Please suggest us right way.

Thanks in advance!

The fastest way to get these to PostgreSQL is to use the COPY command, outside any SQLAlchemy.

Within SQLAlchemy one must note that the ORM is very slow. It is doubly slow if you have lots of stuff in ORM that you then flush. You could make it faster by doing flushes after 1000 items or so; it would also make sure that the session does not grow too big. However, why just not use SQLAlchemy Core to generate inserts :

 ins = violations.insert().values(col1='value', col2='value')
 conn.execute(ins)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM