簡體   English   中英

python multiprocessing + peewee + postgresql 因 SSL 錯誤而失敗

[英]python multiprocessing + peewee + postgresql fails with SSL error

我正在嘗試編寫一個 Python 模型,該模型能夠使用多線程模塊和 peewee 在 PostgreSQL 數據庫中進行一些處理。

但是,在單核模式下,代碼可以工作,但是,當我嘗試使用多核運行代碼時,我遇到了 SSL 錯誤。

我想發布我的模型的結構,希望有人可以建議如何以適當的方式設置我的模型。 目前,我選擇使用面向對象的方法,在該方法中我建立一個在池中共享的連接。 為了澄清我所做的,我現在將展示我目前擁有的源代碼

我有三個文件:main.py、models.py 和 parser.py。 內容如下

models.py 定義了 peewee postgresql 表並連接到 postgres 服務器

import peewee as pw
from playhouse.pool import PooledPostgresqlExtDatabase

KVK_KEY = "id_number"
NAME_KEY = "name"
N_VOWELS_KEY = "n_vowels"

# initialise the data base
database = PooledPostgresqlExtDatabase(
    "testdb", user="postgres", host="localhost", port=5432, password="xxxx",
    max_connections=8, stale_timeout=300 )


class BaseModel(pw.Model):
    class Meta:
        database = database
        only_save_dirty = True


# this class describes the format of the sql data base
class Company(BaseModel):
    id_number = pw.IntegerField(primary_key=True)
    name = pw.CharField(null=True)
    n_vowels = pw.IntegerField(default=-1)
    processor = pw.IntegerField(default=-1)


def connect_database(database_name, reset_database=False):
    """ connect the database """
    database.connect()
    if reset_database:
        database.drop_tables([Company])
    database.create_tables([Company])

parser.py 包含 CompanyParser 類,該類用作代碼的引擎來完成所有處理。 它生成一些人工數據存儲到 postgresql 數據庫中,然后使用run方法對已經存儲在數據庫中的數據進行一些處理

import pandas as pd
import numpy as np
import random
import string
import peewee as pw
from models import (Company, database, KVK_KEY, NAME_KEY)
import multiprocessing as mp

MAX_SQL_CHUNK = 1000

np.random.seed(0)


def random_name(size=8, chars=string.ascii_lowercase):
    """ Create a random character string of 'size' characters """
    return "".join(random.choice(chars) for _ in range(size))


def vowel_count(characters):
    """
    Count the number of vowels in the string 'characters' and return as an integer
    """
    count = 0
    for char in characters:
        if char in list("aeiou"):
            count += 1
    return count


class CompanyParser(mp.Process):
    def __init__(self, number_of_companies=100, i_proc=None,
                 number_of_procs=1,
                 first_id=None, last_id=None):
        if i_proc is not None and number_of_procs > 1:
            mp.Process.__init__(self)

        self.i_proc = i_proc
        self.number_of_procs = number_of_procs
        self.n_companies = number_of_companies
        self.data_df: pd.DataFrame = None

        self.first_id = first_id
        self.last_id = last_id

    def generate_data(self):
        """ Create a dataframe with fake company data and id's """
        id_list = np.random.randint(1000000, 9999999, self.n_companies)
        company_list = np.array([random_name() for _ in range(self.n_companies)])
        self.data_df = pd.DataFrame(data=np.vstack([id_list, company_list]).T,
                                    columns=[KVK_KEY, NAME_KEY])
        self.data_df.sort_values([KVK_KEY], inplace=True)

    def store_to_database(self):
        """
        Store the company data to a sql database
        """
        record_list = list(self.data_df.to_dict(orient="index").values())

        n_batch = int(len(record_list) / MAX_SQL_CHUNK) + 1

        with database.atomic():
            for cnt, batch in enumerate(pw.chunked(record_list, MAX_SQL_CHUNK)):
                print(f"writing {cnt}/{n_batch}")
                Company.insert_many(batch).execute()

    def run(self):
        print("Making query at {}".format(self.i_proc))
        query = (Company.
                 select().
                 where(Company.id_number.between(self.first_id, self.last_id)))
        print("Found {} companies".format(query.count()))

        for cnt, company in enumerate(query):
            print("Processing @ {} - {}:  company {}/{}".format(self.i_proc, cnt,
                                                                company.id_number,
                                                                company.name))
            number_of_vowels = vowel_count(company.name)
            company.n_vowels = number_of_vowels
            company.processor = self.i_proc
            print(f"storing number of vowels: {number_of_vowels}")
            company.save()

最后,我的主腳本加載存儲在 models.py 和 parser.py 中的類並啟動代碼。

from models import (Company, connect_database)
from parser import CompanyParser

number_of_processors = 2
connect_database(None, reset_database=True)

# init an object of the CompanyParser and use the create database 
parser = CompanyParser()

company_ids = Company.select(Company.id_number)
parser.generate_data()
parser.store_to_database()

n_companies = company_ids.count()
n_comp_per_proc = int(n_companies / number_of_processors)
print("Found {} companies: {} per proc".format(n_companies, n_comp_per_proc))

for i_proc in range(number_of_processors):
    i_start = i_proc * n_comp_per_proc
    first_id = company_ids[i_start]
    last_id = company_ids[i_start + n_comp_per_proc - 1]

    print(f"Running proc {i_proc} for id {first_id} until id {last_id}")
    sub_parser = CompanyParser(first_id=first_id, last_id=last_id,
                               i_proc=i_proc,
                               number_of_procs=number_of_processors)

    if number_of_processors > 1:
        sub_parser.start()
    else:
        sub_parser.run()

如果number_of_processors = 1,此腳本可以正常工作。 它生成人工數據,將其存儲到 PostgreSQL 數據庫並對數據進行一些處理(它計算名稱中的元音數量並將其存儲到 n_vowels 列中)

但是,如果我嘗試使用number_of_processors = 2 的2 個內核運行此程序,則會遇到以下錯誤

/opt/miniconda3/bin/python /home/eelco/PycharmProjects/multiproc_peewee/main.py
writing 0/1
Found 100 companies: 50 per proc
Running proc 0 for id 1020737 until id 5295565
Running proc 1 for id 5302405 until id 9891087
Making query at 0
Found 50 companies
Processing @ 0 - 0:  company 1020737/wqrbgxiu
storing number of vowels: 2
Making query at 1
Process CompanyParser-1:
Processing @ 0 - 1:  company 1086107/lkbagrbc
storing number of vowels: 1
Processing @ 0 - 2:  company 1298367/nsdjsqio
storing number of vowels: 2
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2714, in execute_sql
    cursor.execute(sql, params or ())
psycopg2.OperationalError: SSL error: sslv3 alert bad record mac


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/eelco/PycharmProjects/multiproc_peewee/parser.py", line 82, in run
    company.save()
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 5748, in save
    rows = self.update(**field_dict).where(self._pk_expr()).execute()
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1625, in inner
    return method(self, database, *args, **kwargs)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1696, in execute
    return self._execute(database)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2121, in _execute
    cursor = database.execute(self)
  File "/opt/miniconda3/lib/python3.7/site-packages/playhouse/postgres_ext.py", line 468, in execute
    cursor = self.execute_sql(sql, params, commit=commit)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2721, in execute_sql
    self.commit()
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2512, in __exit__
    reraise(new_type, new_type(*exc_args), traceback)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 186, in reraise
    raise value.with_traceback(tb)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2714, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: SSL error: sslv3 alert bad record mac

Process CompanyParser-2:
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2714, in execute_sql
    cursor.execute(sql, params or ())
psycopg2.OperationalError: SSL error: decryption failed or bad record mac


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/eelco/PycharmProjects/multiproc_peewee/parser.py", line 72, in run
    print("Found {} companies".format(query.count()))
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1625, in inner
    return method(self, database, *args, **kwargs)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1881, in count
    return Select([clone], [fn.COUNT(SQL('1'))]).scalar(database)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1625, in inner
    return method(self, database, *args, **kwargs)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1866, in scalar
    row = self.tuples().peek(database)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1625, in inner
    return method(self, database, *args, **kwargs)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1853, in peek
    rows = self.execute(database)[:n]
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1625, in inner
    return method(self, database, *args, **kwargs)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1696, in execute
    return self._execute(database)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 1847, in _execute
    cursor = database.execute(self)
  File "/opt/miniconda3/lib/python3.7/site-packages/playhouse/postgres_ext.py", line 468, in execute
    cursor = self.execute_sql(sql, params, commit=commit)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2721, in execute_sql
    self.commit()
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2512, in __exit__
    reraise(new_type, new_type(*exc_args), traceback)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 186, in reraise
    raise value.with_traceback(tb)
  File "/opt/miniconda3/lib/python3.7/site-packages/peewee.py", line 2714, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: SSL error: decryption failed or bad record mac


Process finished with exit code 0

不知何故,一旦第二個線程開始對數據庫執行某些操作,就會出現問題。 有人建議讓這段代碼正常工作嗎? 我已經嘗試過以下

  • 嘗試使用 PooledPostgresDatabase 和普通 PostgresqlDatabase 連接到數據庫。 這會導致同樣的錯誤
  • 嘗試使用 sqlite 代替 postgres。 這適用於 2 個內核,但前提是兩個進程不會干擾太多; 否則我會遇到一些鎖定問題。 我的印象是 postgres 比 sqlite 更適合進行多處理(這是真的嗎?)
  • 在啟動第一個進程后暫停時(僅使用一個核心如此有效),代碼有效,表明正確調用了start方法。

希望有人能指教。

問候 Eelco

今天在互聯網上進行了一些搜索后,我在這里找到了我的問題的解決方案: github.com/coleifer 正如 coleifer 提到的:在開始連接到數據庫之前,您顯然首先必須設置所有分支。 基於這個想法,我修改了我的代碼,現在可以正常工作了。

對於那些感興趣的人,我將再次發布我的 python 腳本,以便您可以看到我是如何做到的。 這是因為我沒有那么多明確的例子,所以也許它可以幫助其他人。

首先,所有的數據庫和 peewee 模塊現在都被轉移到初始化函數中,這些函數只在 CompanyParser 類的構造函數中調用。 所以models.py看起來像

import peewee as pw
from playhouse.pool import PooledPostgresqlExtDatabase, PostgresqlDatabase, PooledPostgresqlDatabase

KVK_KEY = "id_number"
NAME_KEY = "name"
N_VOWELS_KEY = "n_vowels"


def init_database():
    db = PooledPostgresqlDatabase(
        "testdb", user="postgres", host="localhost", port=5432, password="xxxxx",
        max_connections=8, stale_timeout=300)
    return db


def init_models(db, reset_tables=False):

    class BaseModel(pw.Model):
        class Meta:
            database = db

    # this class describes the format of the sql data base
    class Company(BaseModel):
        id_number = pw.IntegerField(primary_key=True)
        name = pw.CharField(null=True)
        n_vowels = pw.IntegerField(default=-1)
        processor = pw.IntegerField(default=-1)

    if db.is_closed():
        db.connect()
    if reset_tables and Company.table_exists():
        db.drop_tables([Company])
    db.create_tables([Company])

    return Company

然后,在 parser.py 腳本中定義了工作類 CompanyParser,如下所示

import multiprocessing as mp
import random
import string

import numpy as np
import pandas as pd
import peewee as pw

from models import (KVK_KEY, NAME_KEY, init_database, init_models)

MAX_SQL_CHUNK = 1000

np.random.seed(0)


def random_name(size=32, chars=string.ascii_lowercase):
    """ Create a random character string of 'size' characters """
    return "".join(random.choice(chars) for _ in range(size))


def vowel_count(characters):
    """
    Count the number of vowels in the string 'characters' and return as an integer
    """
    count = 0
    for char in characters:
        if char in list("aeiou"):
            count += 1
    return count


class CompanyParser(mp.Process):
    def __init__(self, reset_tables=False,
                 number_of_companies=100, i_proc=None,
                 number_of_procs=1, first_id=None, last_id=None):
        if i_proc is not None and number_of_procs > 1:
            mp.Process.__init__(self)

        self.i_proc = i_proc
        self.reset_tables = reset_tables

        self.number_of_procs = number_of_procs
        self.n_companies = number_of_companies
        self.data_df: pd.DataFrame = None

        self.first_id = first_id
        self.last_id = last_id

        # initialise the database and models
        self.database = init_database()
        self.Company = init_models(self.database, reset_tables=self.reset_tables)

    def generate_data(self):
        """ Create a dataframe with fake company data and id's and return the array of id's"""
        id_list = np.random.randint(1000000, 9999999, self.n_companies)
        company_list = np.array([random_name() for _ in range(self.n_companies)])
        self.data_df = pd.DataFrame(data=np.vstack([id_list, company_list]).T,
                                    columns=[KVK_KEY, NAME_KEY])
        self.data_df.drop_duplicates([KVK_KEY], inplace=True)
        self.data_df.sort_values([KVK_KEY], inplace=True)
        return self.data_df[KVK_KEY].values

    def store_to_database(self):
        """
        Store the company data to a sql database
        """
        record_list = list(self.data_df.to_dict(orient="index").values())

        n_batch = int(len(record_list) / MAX_SQL_CHUNK) + 1

        with self.database.atomic():
            for cnt, batch in enumerate(pw.chunked(record_list, MAX_SQL_CHUNK)):
                print(f"writing {cnt}/{n_batch}")
                self.Company.insert_many(batch).execute()

    def run(self):
        query = (self.Company.
                 select().
                 where(self.Company.id_number.between(self.first_id, self.last_id)))

        for cnt, company in enumerate(query):
            print("Processing @ {} - {}:  company {}/{}".format(self.i_proc, cnt, company.id_number,
                                                                company.name))
            number_of_vowels = vowel_count(company.name)
            company.n_vowels = number_of_vowels
            company.processor = self.i_proc
            try:
                company.save()
            except (pw.OperationalError, pw.InterfaceError) as err:
                print("failed save for {} {}: {}".format(self.i_proc, cnt, err))
            else:
                pass

最后,啟動進程的 main.py 腳本:

from parser import CompanyParser
import time


def main():
    number_of_processors = 2
    number_of_companies = 10000

    parser = CompanyParser(number_of_companies=number_of_companies, reset_tables=True)
    company_ids = parser.generate_data()
    parser.store_to_database()

    n_companies = company_ids.size
    n_comp_per_proc = int(n_companies / number_of_processors)
    print("Found {} companies: {} per proc".format(n_companies, n_comp_per_proc))
    if not parser.database.is_closed():
        parser.database.close()

    processes = list()
    for i_proc in range(number_of_processors):
        i_start = i_proc * n_comp_per_proc
        first_id = company_ids[i_start]
        last_id = company_ids[i_start + n_comp_per_proc - 1]

        print(f"Running proc {i_proc} for id {first_id} until id {last_id}")

        sub_parser = CompanyParser(first_id=first_id, last_id=last_id, i_proc=i_proc,
                                   number_of_procs=number_of_processors)

        if number_of_processors > 1:
            sub_parser.start()
        else:
            sub_parser.run()

        processes.append(sub_parser)

    # this blocks the script until all processes are done
    for job in processes:
        job.join()

    # make sure all the connections are closed
    for i_proc in range(number_of_processors):
        db = processes[i_proc].database
        if not db.is_closed():
            db.close()
    print("Goodbye!")


if __name__ == "__main__":

    start = time.time()
    main()
    duration = time.time() - start
    print(f"Done in {duration} s")

如您所見,數據庫連接是在類內的每個進程中完成的。 這個例子是一個完整的 multiprocessing + peewee 和 PostgreSQL 例子。 希望這可以幫助其他人。 如果您有任何改進意見或建議,請告訴我。

我也遇到了這個錯誤,但是在Heroku使用了flask + peewee + rq 以下是我解決它的方法:

如果您有一個與 RQ 一起使用的簡單應用程序,我建議您使用SimpleWorker

RQ 建議使用rq.worker.HerokuWorker但我仍然收到了一個ssl 錯誤 該錯誤出現在我創建了后續(鏈)任務的情況下,其中 1 的執行取決於另一個任務的成功。

另外我使用的是flask-rq2,但適用於正常使用以及如下所示:

# app.py
app = Flask(__name__)
app.config['RQ_WORKER_CLASS'] = os.getenv('RQ_WORKER_CLASS', 'rq.worker.Worker')
rq = RQ(app)

我通過在 heroku 配置中更改以下內容來解決它:

  • 將您的RQ_WORKER_CLASS設置為rq.worker.SimpleWorker

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM