简体   繁体   中英

Fast loading and querying data in Python

I am doing some data analysis in Python. I have ~15k financial products identified by ISIN code and ~15 columns of daily data for each of them. I would like to easily and quickly access the data given an ISIN code.

The data is in a MySQL DB. On the Python side so far I have been working with Pandas DataFrame.

First thing I did was to use pd.read_sql to load the DF directly from the database. However, this is relatively slow. Then I tried loading the full database in a single DF and serializing it to a pickle file. The loading of the pickle file is fast, a few seconds. However, when querying for an individual product, the perfomance is the same as if I am querying the SQL DB. Here is some code:

import pandas as pd
from sqlalchemy import create_engine, engine
from src.Database import Database
import time
import src.bonds.database.BondDynamicDataETL as BondsETL

database_instance = Database(Database.get_db_instance_risk_analytics_prod())

engine = create_engine(
    "mysql+pymysql://"
    + database_instance.get_db_user()
    + ":"
    + database_instance.get_db_pass()
    + "@"
    + database_instance.get_db_host()
    + "/"
    + database_instance.get_db_name()
)
con = engine.connect()


class DataBase:
    def __init__(self):
        print("made a DatBase instance")

    def get_individual_bond_dynamic_data(self, isin):
        return self.get_individual_bond_dynamic_data_from_db(isin, con)

    @staticmethod
    def get_individual_bond_dynamic_data_from_db(isin, connection):
        df = pd.read_sql(
            "SELECT * FROM BondDynamicDataClean WHERE isin = '"
            + isin
            + "' ORDER BY date ASC",
            con=connection,
        )
        df.set_index("date", inplace=True)
        return df


class PickleFile:
    def __init__(self):
        print("made a PickleFile instance")
        df = pd.read_pickle("bonds_pickle.pickle")
        # df.set_index(['isin', 'date'], inplace=True)
        self.data = df
        print("loaded file")

    def get_individual_bond_dynamic_data(self, isin):
        result = self.data.query("isin == '@isin'")
        return result


fromPickle = PickleFile()
fromDB = DataBase()

isins = BondsETL.get_all_isins_with_dynamic_data_from_db(
    connection=con,
    table_name=database_instance.get_bonds_dynamic_data_clean_table_name(),
)

isins = isins[0:50]

start_pickle = time.time()

for i, isin in enumerate(isins):
    x = fromPickle.get_individual_bond_dynamic_data(isin)
    print("pickle: " + str(i))

stop_pickle = time.time()

for i, isin in enumerate(isins):
    x = fromDB.get_individual_bond_dynamic_data(isin)
    print("db: " + str(i))

stop_db = time.time()

pickle_t = stop_pickle - start_pickle
db_t = stop_db - stop_pickle
print("pickle: " + str(pickle_t))
print("db: " + str(db_t))
print("ratio: " + str(pickle_t / db_t))

This results in: pickle: 7.636280059814453 db: 6.167926073074341 ratio: 1.23806283819615

Also, curiously enough setting the index on the DF (uncommenting the line in the constructor) slows down everything!

I thought of trying https://www.pytables.org/index.html as an alternative to Pandas. Any other ideas or comments?

Greetings, Georgi

So, collating some thoughts from the comments:

  • Use mysqlclient instead of PyMySQL if you want more speed on the SQL side of the fence.
  • Ensure the columns you're querying on are indexed in your SQL table ( isin for querying and date for ordering).
  • You can set index_col="date" directly in read_sql() according to the docs ; it might be faster.
  • I'm no Pandas expert, but I think self.data[self.data.isin == isin] would be more performant than self.data.query("isin == '@isin'") .
  • If you don't need to query things cross-isin and want to use pickles, you could store the data for each isin in a separate pickle file.
  • Also, for the sake of Lil Bobby Tables, the patron saint of SQL injection attacks, use parameters in your SQL statements instead of concatenating strings.

It helped a lot to transform the large data frame into a dictionary {isin -> DF} of smaller data frames indexed by ISIN code. Data retrieval is much more efficient from a dictionary compared to from a DF. Also, it is very natural to be able to request a single DF given an ISIN code. Hope this helps someone else.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM