简体   繁体   English

在 SQLite 数据库中插入数百万行,Python 太慢

[英]Inserting Millions of Rows in a an SQLite Database, Python is Too Slow

I'm making a chess engine (a program that plays chess), for this I have decided to use some chess statistics to choose the optimal moves.我正在制作一个国际象棋引擎(一个下棋的程序),为此我决定使用一些国际象棋统计数据来选择最佳移动。 I don't have these statistics, so I decided to collected them myself from millions of games.我没有这些统计数据,所以我决定自己从数百万个游戏中收集它们。 I'm interested in the current move , the next move and how much times the next move was played given the current move .我对当前的移动下一步以及在当前移动的情况下下一个移动的次数感兴趣。

I thought about simply using a python dictionary and storing it with pickle, but the problem is the file is too large, and hard to update with new games.我想过简单地使用 python 字典并将其与 pickle 一起存储,但问题是文件太大,并且很难用新游戏更新。 So I decided to use an SQL database, more precisely SQLite.所以我决定使用 SQL 数据库,更准确地说是 SQLite。

I created a class MovesDatabase:我创建了一个 class MovesDatabase:

class MovesDatabase:

def __init__(self, work_dir):
    self.con = sqlite3.connect(os.path.join(work_dir, "moves.db"))
    self.con.execute('PRAGMA temp_store = MEMORY')
    self.con.execute('PRAGMA synchronous = NORMAL')
    self.con.execute('PRAGMA journal_mode = WAL')
    self.cur = self.con.cursor()

    self.cur.execute("CREATE TABLE IF NOT EXISTS moves("
                     "move TEXT,"
                     "next TEXT,"
                     "count INTEGER DEFAULT 1);")

The table hold there information per row: move, next, count.该表包含每行的信息:移动、下一步、计数。 Move and Next represent the state of a chess board in a string format: FEN Move和Next以字符串格式表示棋盘的state:FEN

Example:例子:

  • rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR
  • r1b1k1nr/p2p1pNp/n2B4/1p1NP2P/6P1/3P1Q2/P1P1K3/q5b1 r1b1k1nr/p2p1pNp/n2B4/1p1NP2P/6P1/3P1Q2/P1P1K3/q5b1
  • 8/8/8/4p1K1/2k1P3/8/8/8 b 8/8/8/4p1K1/2k1P3/8/8/8 b

The method below is responsible for taking a games file, extracting the moves and inserting if the couple (move, next) is new or updating if (move, next) already exist in the database:下面的方法负责获取一个游戏文件,如果这对 (move, next) 是新的,则提取移动并插入,或者如果数据库中已经存在 (move, next) 则更新:

def insert_moves_from_file(self, file: str):
    print("Extracting moves to database from " + file)

    count = 0

    with open(file) as games_file:
        game = chess.pgn.read_game(games_file)

        while game is not None:
            batch = []
            board = game.board()
            state_one = board.fen().split(' ')[0] + ' ' + board.fen().split(' ')[1]

            for move in game.mainline_moves():
                board.push(move)
                fen = board.fen().split(' ')
                state_two = fen[0] + ' ' + fen[1]

                res = self.cur.execute("SELECT * FROM moves WHERE move=? AND next=?",
                                       (state_one, state_two))
                res = res.fetchall()

                if len(res) != 0:
                    self.cur.execute("UPDATE moves SET count=count+1 WHERE move=? AND next=?",
                                     (state_one, state_two))
                else:
                    batch.append((state_one, state_two))

                state_one = state_two

            self.cur.executemany("INSERT INTO moves(move, next) VALUES"
                                 "(?, ?)", batch)
            count += 1
            print('\r' "%d games was add to the database.." % (count + 1), end='')
            game = chess.pgn.read_game(games_file)

    self.con.commit()
    print("\n Finished!")

The couple (move, next) is unique!这对夫妇(移动,下一步)是独一无二的!

The problem is: I test this method with a file containing approximately 4 million (move, next), It's started good inserting/updating 3000 rows/s, but when table gets larger, say 50K rows it's slows down to a rate of 100 rows/s and keeps going down.问题是:我用一个包含大约 400 万行(移动,下一个)的文件测试这个方法,它开始很好地插入/更新 3000 行/秒,但是当表变大时,比如 50K 行,它会减慢到 100 行的速度/s 并继续下降。 Keep in mind that I designed this method so it can process multiple game files, that's the reason why I choose to work with an SQL db the first place.请记住,我设计了这种方法,以便它可以处理多个游戏文件,这就是为什么我首先选择使用 SQL db 的原因。

It's not INSERT ing that's slow here.这不是INSERT在这里很慢。

Your move and next columns aren't indexed, so any SELECT or UPDATE involving those columns requires a full table scan.您的movenext列没有被索引,因此任何涉及这些列的SELECTUPDATE都需要全表扫描。

If (move, next) is always unique, you'll want to add an UNIQUE index on that.如果(move, next)始终是唯一的,您需要在其上添加一个UNIQUE索引。 It will also automagically make the queries that query for move / next pairs faster (but not necessarily those that query for only one of those two columns).它还将自动使查询move / next对的查询更快(但不一定是那些仅查询这两列之一的查询)。

To create that index on your existing table,要在现有表上创建该索引,

CREATE UNIQUE INDEX ix_move_next ON moves (move, next);

Finally, once you have that index in place, you can get rid of the whole SELECT / UPDATE thing too with an upsert:最后,一旦你有了那个索引,你也可以通过 upsert 摆脱整个SELECT / UPDATE事情:

INSERT INTO moves (move, next) VALUES (?, ?) ON CONFLICT (move, next) DO UPDATE SET count = count + 1;

Here's a slight refactoring that achieves about 6200 moves/second inserted on my machine.这是一个轻微的重构,在我的机器上插入了大约 6200 次移动/秒。 (It requires the tqdm library for a nice progress bar, and a pgns/ directory with PGN files.) (它需要tqdm库以获得漂亮的进度条,以及一个包含 PGN 文件的pgns/目录。)

import glob
import sqlite3
import chess.pgn
import tqdm
from chess import WHITE


def board_to_state(board):
    # These were extracted from the implementation of `board.fen()`
    # so as to avoid doing extra work we don't need.
    bfen = board.board_fen(promoted=False)
    turn = ("w" if board.turn == WHITE else "b")
    return f'{bfen} {turn}'


def insert_game(cur, game):
    batch = []
    board = game.board()
    state_one = board_to_state(board)
    for move in game.mainline_moves():
        board.push(move)
        state_two = board_to_state(board)
        batch.append((state_one, state_two))
        state_one = state_two
    cur.executemany("INSERT INTO moves (move, next) VALUES (?, ?) ON CONFLICT (move, next) DO UPDATE SET count = count + 1", batch)
    n_moves = len(batch)
    return n_moves


def main():
    con = sqlite3.connect("moves.db")
    con.execute('PRAGMA temp_store = MEMORY')
    con.execute('PRAGMA synchronous = NORMAL')
    con.execute('PRAGMA journal_mode = WAL')
    con.execute('CREATE TABLE IF NOT EXISTS moves(move TEXT,next TEXT,count INTEGER DEFAULT 1);')
    con.execute('CREATE UNIQUE INDEX IF NOT EXISTS ix_move_next ON moves (move, next);')

    cur = con.cursor()

    for pgn_file in sorted(glob.glob("pgns/*.pgn")):
        with open(pgn_file) as games_file:
            n_games = 0
            with tqdm.tqdm(desc=pgn_file, unit="moves") as pbar:
                while (game := chess.pgn.read_game(games_file)):
                    n_moves = insert_game(cur, game)
                    n_games += 1
                    pbar.set_description(f"{pgn_file} ({n_games} games)", refresh=False)
                    pbar.update(n_moves)
            con.commit()


if __name__ == '__main__':
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM