I am doing some data analysis in Python. I have ~15k financial products identified by ISIN code and ~15 columns of daily data for each of them. I would like to easily and quickly access the data given an ISIN code.
The data is in a MySQL DB. On the Python side so far I have been working with Pandas DataFrame.
First thing I did was to use pd.read_sql to load the DF directly from the database. However, this is relatively slow. Then I tried loading the full database in a single DF and serializing it to a pickle file. The loading of the pickle file is fast, a few seconds. However, when querying for an individual product, the perfomance is the same as if I am querying the SQL DB. Here is some code:
import pandas as pd
from sqlalchemy import create_engine, engine
from src.Database import Database
import time
import src.bonds.database.BondDynamicDataETL as BondsETL
database_instance = Database(Database.get_db_instance_risk_analytics_prod())
engine = create_engine(
"mysql+pymysql://"
+ database_instance.get_db_user()
+ ":"
+ database_instance.get_db_pass()
+ "@"
+ database_instance.get_db_host()
+ "/"
+ database_instance.get_db_name()
)
con = engine.connect()
class DataBase:
def __init__(self):
print("made a DatBase instance")
def get_individual_bond_dynamic_data(self, isin):
return self.get_individual_bond_dynamic_data_from_db(isin, con)
@staticmethod
def get_individual_bond_dynamic_data_from_db(isin, connection):
df = pd.read_sql(
"SELECT * FROM BondDynamicDataClean WHERE isin = '"
+ isin
+ "' ORDER BY date ASC",
con=connection,
)
df.set_index("date", inplace=True)
return df
class PickleFile:
def __init__(self):
print("made a PickleFile instance")
df = pd.read_pickle("bonds_pickle.pickle")
# df.set_index(['isin', 'date'], inplace=True)
self.data = df
print("loaded file")
def get_individual_bond_dynamic_data(self, isin):
result = self.data.query("isin == '@isin'")
return result
fromPickle = PickleFile()
fromDB = DataBase()
isins = BondsETL.get_all_isins_with_dynamic_data_from_db(
connection=con,
table_name=database_instance.get_bonds_dynamic_data_clean_table_name(),
)
isins = isins[0:50]
start_pickle = time.time()
for i, isin in enumerate(isins):
x = fromPickle.get_individual_bond_dynamic_data(isin)
print("pickle: " + str(i))
stop_pickle = time.time()
for i, isin in enumerate(isins):
x = fromDB.get_individual_bond_dynamic_data(isin)
print("db: " + str(i))
stop_db = time.time()
pickle_t = stop_pickle - start_pickle
db_t = stop_db - stop_pickle
print("pickle: " + str(pickle_t))
print("db: " + str(db_t))
print("ratio: " + str(pickle_t / db_t))
This results in: pickle: 7.636280059814453 db: 6.167926073074341 ratio: 1.23806283819615
Also, curiously enough setting the index on the DF (uncommenting the line in the constructor) slows down everything!
I thought of trying https://www.pytables.org/index.html as an alternative to Pandas. Any other ideas or comments?
Greetings, Georgi
So, collating some thoughts from the comments:
mysqlclient
instead of PyMySQL if you want more speed on the SQL side of the fence. isin
for querying and date
for ordering). index_col="date"
directly in read_sql()
according to the docs ; it might be faster. self.data[self.data.isin == isin]
would be more performant than self.data.query("isin == '@isin'")
. It helped a lot to transform the large data frame into a dictionary {isin -> DF} of smaller data frames indexed by ISIN code. Data retrieval is much more efficient from a dictionary compared to from a DF. Also, it is very natural to be able to request a single DF given an ISIN code. Hope this helps someone else.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.