Multi Processing with sqlalchemy

Question

I have a python script that handles data transactions through sqlalchemy using:


def save_update(args):
    session, engine = create_session(config["DATABASE"])

    try:
        instance = get_record(session)
        if instance is None:
            instance = create_record(session)
        else:
            instance = update_record(session, instance)

        sync_errors(session, instance)
        sync_expressions(session, instance)
        sync_part(session, instance)

        session.commit()
    except:
        session.rollback()
        write_error(config)
        raise
    finally:
        session.close()

On top of the data transactions, I have also some processing unrelated to the database - data preparation before I can do my data transaction. Those pre required tasks are taking some time so I wanted to execute multiples instances in parallel of this full script (data preparation + data transactions with sqlalchemy).

I am thus doing in a different script, simplified example here:

process1 = Thread(target=call_script, args=[["python", python_file_path,
     "-xml", xml_path,
     "-o", args.outputFolder,
     "-l", log_path]])

process2 = Thread(target=call_script, args=[["python", python_file_path,
     "-xml", xml_path,
     "-o", args.outputFolder,
     "-l", log_path]])

process1.start()
process2.start()
process1.join()
process2.join()

The target function "call_script" executes the firstly mentioned script above (data preparation + data transactions with sqlalchemy):

def call_script(args):
    status = subprocess.call(args, shell=True)
    print(status)

So now to summarize, I will for instance have 2 sub threads + the main one running. Each of those sub thread are executing sqlalchemy code in a separate process.

My question thus is should I be taking care of any specific considerations regarding the multi processing side of my code with sqlalchemy? For me the answer is no as this is multi processing and not multi threading exclusively due to the fact that use subprocess.call() to execute my code.

Now in reality, from time to time I kind of feel I have database locks during execution. Not sure if this is related to my code or someone else is hitting the database while I am processing it as well but I was expecting that each subprocess actually lock the database for when starting to do its work so that other subprocesses are thus waiting for current session to closes.

EDIT

I have used multi processing to replace multi threading for testing:

    processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]

I still have same issue on which I got more details: I see SQL Server is showing status "AWAITING COMMAND" and this only goes away when I kill the related python process executing the command. I feel it appears when I am intensely parallelizing the sub processes but really not sure.

Thanks in advance for any support.

Answer 1

This is an interesting situation. It seems that maybe you can sidestep some of the manual process/thread handling and utilize something like multiprocessing's Pool . I made an example based on some other data initializing code I had. This delegates creating test data and inserting it for each of 10 "devices" to a pool of 3 processes. One caveat that seems necessary is to dispose of the engine before it is shared across fork() , ie. before the Pool tasks are created, this is mentioned here: engine-disposal

from random import randint
from datetime import datetime
from multiprocessing import Pool

from sqlalchemy import (
    create_engine,
    Integer,
    DateTime,
    String,
)
from sqlalchemy.schema import (
    Column,
    MetaData,
    ForeignKey,
)
from sqlalchemy.orm import declarative_base, relationship, Session, backref

db_uri = 'postgresql+psycopg2://username:password@/database'

engine = create_engine(db_uri, echo=False)

metadata = MetaData()

Base = declarative_base(metadata=metadata)

class Event(Base):
    __tablename__ = "events"
    id = Column(Integer, primary_key=True, index=True)
    created_on = Column(DateTime, nullable=False, index=True)
    device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
    device = relationship('Device', backref=backref("events"))


class Device(Base):
    __tablename__ = "devices"
    id = Column(Integer, primary_key=True, autoincrement=True)
    name = Column(String(50))


def get_test_data(device_num):
    """ Generate a test device and its test events for the given device number. """
    device_dict = dict(name=f'device-{device_num}')
    event_dicts = []
    for day in range(1, 5):
        for hour in range(0, 24):
            for _ in range(0, randint(0, 50)):
                event_dicts.append({
                    "created_on": datetime(day=day, month=1, year=2022, hour=hour),
                })
    return (device_dict, event_dicts)


def create_test_data(device_num):
    """ Actually write the test data to the database. """
    device_dict, event_dicts = get_test_data(device_num)
    print (f"creating test data for {device_dict['name']}")

    with Session(engine) as session:
        device = Device(**device_dict)
        session.add(device)
        session.flush()
        events = [Event(**event_dict) for event_dict in event_dicts]
        event_count = len(events)
        device.events.extend(events)
        session.add_all(events)
        session.commit()
    return event_count


if __name__ == '__main__':

    metadata.create_all(engine)

    # Throw this away before fork.
    engine.dispose()

    # I have a 4-core processor, so I chose 3.
    with Pool(3) as p:
        print (p.map(create_test_data, range(0, 10)))

    # Accessing engine here should still work
    # but a new connection will be created.
    with Session(engine) as session:
        print (session.query(Event).count())

Output


creating test data for device-1
creating test data for device-0
creating test data for device-2
creating test data for device-3
creating test data for device-4
creating test data for device-5
creating test data for device-6
creating test data for device-7
creating test data for device-8
creating test data for device-9
[2511, 2247, 2436, 2106, 2244, 2464, 2358, 2512, 2267, 2451]
23596

Answer 2

I am answering my question as it did not relate to SQLAlchemy at all in the end. When executing:

processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]

On a specific batch and for no direct reasons, one of the subprocess was not exiting properly although the script it called was arriving at the end. I searched and saw that this is an issue of using p.wait() with Popen having shell=True.

I set shell=False and used Pipes to stdout and stderr and also added a sys.exit(0) at the end of the python script being executed by the subprocess to make sure it terminates properly the execution.

Hope it can help someone else. Thanks Ian for your support.

Multi Processing with sqlalchemy

Question

2 answers

solution1
0 2022-01-31 06:48:59

Output

solution2
0 ACCPTED 2022-02-01 22:50:00

Multi Processing with sqlalchemy

Question

2 answers

solution1 0 2022-01-31 06:48:59

Output

solution2 0 ACCPTED 2022-02-01 22:50:00

solution1
0 2022-01-31 06:48:59

solution2
0 ACCPTED 2022-02-01 22:50:00