Is there a way I can use multi-threading or multi-processing in python to connect to 200 different servers and download data from them

Question

I am writing a script inn python to download data from around 200 different servers using multi-threading. My Objective is to fetch data from a table in Database of the server and save the data into a csv file. All the servers have the database and table.

The code I have written is:

import concurrent.futures
import sqlalchemy as db
import urllib
import pandas as pd


def write_to_database(line):
    try:
        server = line[0]
        filename = line[1]
        file_format = ".csv"
        file = filename + file_format
        print(file)
        params = urllib.parse.quote_plus(
            "DRIVER={SQL Server};SERVER=" + server + ";DATABASE=Database_name;UID=xxxxxxx;PWD=xxxxxxx")
        engine = db.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
        sql_DF = pd.read_sql("SELECT * FROM table_name",
                             con=engine, chunksize=50000)
        sql_DF.to_csv()


    except Exception as e:
        print(e)


def read_server_names():
    print("Reading Server Names")
    f = open("servers_data.txt", "r")
    contents = f.readlines()
    for line in contents:
        list.append(line.split(','))


def main():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for line in zip(list, executor.map(write_to_database, list)):
            print()


if __name__ == '__main__':
    list = []
    read_server_names()
    main()

The problem with this code is the process is taking lot of system memory. Can I get some guidance on a better way to do this task by using either multi-threading or multi-processing? Which will give good performance in terms of using less CPU resources!

Answer 1

I'd suggest using multiprocessing . I've also slightly refactored your reading code to avoid using a global variable.

The write function now prints three status messages; one when it begins reading from a given server, another when it finishes reading (into memory.) and another when it has finished writing to a file.
Concurrency is limited to 10 tasks, and each worker process is recycled after 100. You may want to change those parameters.
imap_unordered is used for slightly faster performance, since the order of tasks doesn't matter here.

If this is still too resource intensive, you will need to do something else than naively use Pandas; instead maybe just use SQLAlchemy to do the same query and write to the CSV file one row at a time.

import multiprocessing
import sqlalchemy as db
import urllib
import pandas as pd

file_format = ".csv"


def write_to_database(line):
    try:
        server, filename = line
        file = filename + file_format

        params = urllib.parse.quote_plus(
            "DRIVER={SQL Server};"
            "SERVER=" + server + ";"
            "DATABASE=Database_name;"
            "UID=xxxxxxx;"
            "PWD=xxxxxxx"
        )
        print(server, "start")
        engine = db.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
        sql_DF = pd.read_sql("SELECT * FROM table_name", con=engine, chunksize=50000)
        print(server, "read", len(sql_DF))
        sql_DF.to_csv(file)
        print(server, "write", file)
    except Exception as e:
        print(e)


def read_server_names():
    with open("servers_data.txt", "r") as f:
        for line in f:
            # Will break (on purpose) if there are more than 2 fields
            server, filename = f.strip().split(",")
            yield (server, filename)


def main():
    server_names = list(read_server_names())
    # 10 requests (subprocesses) at a time, recycle every 100 servers
    with multiprocessing.Pool(processes=10, maxtasksperchild=100) as p:
        for i, result in enumerate(p.imap_unordered(write_to_database, server_names), 1):
            print("Progress:", i, "/", len(server_names))
            pass  # do nothing with the result, the function deals with it


if __name__ == "__main__":
    main()

Is there a way I can use multi-threading or multi-processing in python to connect to 200 different servers and download data from them

Question

1 answers

solution1
0 2020-06-11 10:40:23

Is there a way I can use multi-threading or multi-processing in python to connect to 200 different servers and download data from them

Question

1 answers

solution1 0 2020-06-11 10:40:23

solution1
0 2020-06-11 10:40:23