Python：使用sqlite3进行多处理

Question

I have a SQLite3 DB. 我有一个SQLite3数据库。 I need to parse 10000 files. 我需要解析10000个文件。 I read some data from each file, and then query the DB with this data to get a result. 我从每个文件中读取一些数据，然后使用此数据查询数据库以获得结果。 My code works fine in a single process environment. 我的代码在单个进程环境中工作正常。 But I get an error when trying to use the mulitprocessing Pool. 但是在尝试使用多重处理池时出现错误。

My approach without multiprocessing (works OK):
1. Open DB connection object
2. for f in files: 
     foo(f, x1=x1, x2=x2, ..., db=DB)
3. Close DB

My approach with multiprocessing (does NOT work):
1. Open DB
2. pool = multiprocessing.Pool(processes=4) 
3. pool.map(functools.partial(foo, x1=x1, x2=x2, ..., db=DB), [files])
4. pool.close()
5. Close DB

I get the following error: sqlite3.ProgrammingError: Base Cursor.__init__ not called. 我收到以下错误： sqlite3.ProgrammingError：未调用Base Cursor .__ init__。

My DB class is implemented as follows: 我的DB类实现如下：

def open_db(sqlite_file):
    """Open SQLite database connection.

    Args:
    sqlite_file -- File path

    Return:
    Connection
    """

    log.info('Open SQLite database %s', sqlite_file)
    try:
        conn = sqlite3.connect(sqlite_file)
    except sqlite3.Error, e:
        log.error('Unable to open SQLite database %s', e.args[0])
        sys.exit(1)

    return conn

def close_db(conn, sqlite_file):
    """Close SQLite database connection.

    Args:
    conn -- Connection
    """

    if conn:
        log.info('Close SQLite database %s', sqlite_file)
        conn.close()

class MapDB:

    def __init__(self, sqlite_file):
        """Initialize.

        Args:
        sqlite_file -- File path
        """

        # 1. Open database.
        # 2. Setup to receive data as dict().
        # 3. Get cursor to execute queries.
        self._sqlite_file      = sqlite_file
        self._conn             = open_db(sqlite_file)
        self._conn.row_factory = sqlite3.Row
        self._cursor           = self._conn.cursor()

    def close(self):
        """Close DB connection."""

        if self._cursor:
            self._cursor.close()
        close_db(self._conn, self._sqlite_file)

    def check(self):
        ...

    def get_driver_net(self, net):
        ...

    def get_cell_id(self, net):
       ...

Function foo() looks like this: 函数foo（）看起来像这样：

def foo(f, x1, x2, db):

  extract some data from file f
  r1 = db.get_driver_net(...)
  r2 = db.get_cell_id(...)

The overall not working implementation is as follows: 整体不起作用的实施如下：

mapdb = MapDB(sqlite_file)

log.info('Create NetInfo objects')
pool = multiprocessing.Pool(processes=4)
files = [get list of files to process]                 
pool.map(functools.partial(foo, x1=x1, x2=x2, db=mapdb), files)    
pool.close()
mapdb.close()

To fix this, I think I need to create the MapDB() object inside each pool worker (so have 4 parallel/independent connections). 为了解决这个问题，我想我需要在每个池工作者中创建MapDB（）对象（因此有4个并行/独立的连接）。 But I'm not sure how to do this. 但我不知道该怎么做。 Can someone show me an example of how to accomplish this with Pool? 有人能告诉我如何用Pool实现这个目标吗？

Answer 1

What about defining foo like this: 那样定义foo怎么样：

def foo(f, x1, x2, db_path):
    mapdb = MapDB(db_path)
    ... open mapdb
    ... process data ...
    ... close mapdb

and then change your pool.map call to: 然后将pool.map调用更改为：

pool.map(functools.partial(foo, x1=x1, x2=x2, db_path="path-to-sqlite3-db"), files)

Update 更新

Another option is to handle the worker threads yourself and distribute work via a Queue . 另一个选择是自己处理工作线程并通过Queue分配工作。

from Queue import Queue
from threading import Thread

q = Queue()

def worker():
  mapdb = ...open the sqlite database
  while True:
    item = q.get()
    if item[0] == "file":
      file = item[1]
      ... process file ...
      q.task_done()
    else:
      q.task_done()
      break
  ...close sqlite connection...

# Start up the workers

nworkers = 4

for i in range(nworkers):
  worker = Thread(target=worker)
  worker.daemon = True
  worker.start()

# Place work on the Queue

for x in ...list of files...:
  q.put(("file",x))

# Place termination tokens onto the Queue

for i in range(nworkers):
  q.put(("end",))

# Wait for all work to be done.

q.join()

The termination tokens are used to ensure that the sqlite connections are closed - in case that matters. 终止令牌用于确保关闭sqlite连接 - 如果重要的话。

Python：使用sqlite3进行多处理

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-05-13 02:47:55

Python：使用sqlite3进行多处理

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-05-13 02:47:55

解决方案1
4 已采纳 2016-05-13 02:47:55