使用python sqlalchemy將數百萬條記錄從sqlite傳輸到postgresql

Question

我們有大約1500個sqlite dbs，每個表中有0到20,000,000條記錄（違規）總數沒有違規記錄大約是90,000,000。

我們通過在1500台服務器上運行爬蟲生成的每個文件。 使用此違規表，我們還有一些其他表，我們將其用於進一步分析。

為了分析結果，我們將所有這些sqlite違規記錄推送到postsgres違規表，以及其他插入和其他計算。

以下是我用來傳輸記錄的代碼，

class PolicyViolationService(object):

def __init__(self, pg_dao, crawler_dao_s):
    self.pg_dao = pg_dao
    self.crawler_dao_s = crawler_dao_s
    self.user_violation_count = defaultdict(int)
    self.analyzer_time_id = self.pg_dao.get_latest_analyzer_tracker()

def process(self):
    """
        transfer policy violation record from crawler db to analyzer db
    """
    for crawler_dao in self.crawler_dao_s:
        violations = self.get_violations(crawler_dao.get_violations())
        self.pg_dao.insert_rows(violations)

def get_violations(self, violation_records):
    for violation in violation_records:
        violation = dict(violation.items())
        violation.pop('id')
        self.user_violation_count[violation.get('user_id')] += 1
        violation['analyzer_time_id'] = self.analyzer_time_id
        yield PolicyViolation(**violation)

in sqlite dao
==============
def get_violations(self):
    result_set = self.db.execute('select * from policyviolations;')
    return result_set

in pg dao
=========
   def insert_rows(self, rows):
       self.session.add_all(rows)
       self.session.commit()

此代碼有效，但需要非常的日志時間。 解決這個問題的正確方法是什么。 請建議，我們一直在討論並行處理，跳過sqlalchemy和其他一些選項。 請建議我們正確的方式。

提前致謝！

Answer 1

將這些內容發送到PostgreSQL的最快方法是在任何SQLAlchemy之外使用COPY命令。

在SQLAlchemy中，必須注意ORM非常慢。 如果你在ORM中有很多東西然后沖洗，那么它會慢得多。 你可以通過在1000件左右之后進行沖洗來加快速度; 它還可以確保會話不會變得太大。 但是，為什么不使用SQLAlchemy Core生成插入：

 ins = violations.insert().values(col1='value', col2='value')
 conn.execute(ins)

使用python sqlalchemy將數百萬條記錄從sqlite傳輸到postgresql

問題描述

1 個解決方案

解決方案1
0 2015-02-26 08:41:24

使用python sqlalchemy將數百萬條記錄從sqlite傳輸到postgresql

問題描述

1 個解決方案

解決方案1 0 2015-02-26 08:41:24

解決方案1
0 2015-02-26 08:41:24