I have a list of named tuples. Each named tuple is a DataPoint
type I have created, that looks like this:
class DataPoint(NamedTuple):
data: float
location_zone: float
analysis_date: datetime
error: float
At various points throughout my code, I have to get all the DataPoints
in the list by a particular attribute. Here's how I do it for analysis_date
, I have similar functions for the other attributes:
def get_data_points_on_date(self, data_points, analysis_date):
data_on_date = []
for data_point in data_points:
if data_point.analysis_date == analysis_date:
data_on_date.append(data_point)
return data_on_date
This is called >100,000 times on lists with thousands of points, so it is slowing down my script significantly.
Instead of a list, I could do a dictionary for a significant speedup, but because I need to search on multiple attributes, there isn't an obvious key. I would probably choose the function that is taking up the most time (in this case, analysis_date
), and use that as the key. However, this would add significant complexity to my code. Is there anything besides hashing / a clever way to hash that is escaping me?
You are right that you want to avoid doing what is essentially a linear search 100,000 times if the data can be pre-computed once. Why not use multiple dictionaries, each keyed by a different attribute of interest?
Each dictionary would be pre-computed once:
self.by_date = defaultdict(list)
for point in data_points:
self.by_date[point.analysis_date].append(point)
Now your get_data_points_for_date
function becomes a one-liner:
def get_data_points_for_date(self, date):
return self.by_date[date]
You could probably remove this method entirely, and just use self.by_date[date]
instead.
This does not increase the complexity of your code, but it does transfer some of the book-keeping burden up front. You could handle that by having a set_data method that pre-computes all the dictionaries you want:
from collections import defaultdict
from operator import attrgetter
def set_data(self, data_points):
keygetter):
d = defaultdict(list)
for point in data_points:
d[key(point)].append(point)
return d
self.by_date = make_dict(attrgetter('analysis_date'))
self.by_zone = make_dict(self.zone_code)
def zone_code(self, data_point):
return int(data_point.location_zone // 0.01)
Something like zone_code
is necessary to convert float
s to integers, since it is not a good idea to rely on float
s as keys.
Perhaps an in memory SQLite database (with column indexes) could help. It even has a way to map rows to named tuples as Mapping result rows to namedtuple in python sqlite describes.
For a more complete solution refer, for example, to http://peter-hoffmann.com/2010/python-sqlite-namedtuple-factory.html .
A basic example based on the two links above:
from typing import NamedTuple
from datetime import datetime
import sqlite3
class DataPoint(NamedTuple):
data: float
location_zone: float
analysis_date: datetime
error: float
def datapoint_factory(cursor, row):
return DataPoint(*row)
def get_data_points_on_date(cursor, analysis_date):
cursor.execute(
f"select * from datapoints where analysis_date = '{analysis_date}'"
)
return cursor.fetchall()
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute(
"create table datapoints "
"(data real, location_zone real, analysis_date text, error timestamp)"
)
c.execute(
"create index if not exists analysis_date_index on datapoints (analysis_date)"
)
timestamp = datetime.now().isoformat()
data_points = [
DataPoint(data=0.5, location_zone=0.1, analysis_date=timestamp, error=0.0)
]
for data_point in data_points:
c.execute(f"insert into datapoints values {tuple(data_point)}")
conn.commit()
c.close()
conn.row_factory = datapoint_factory
c = conn.cursor()
print(get_data_points_on_date(c, timestamp))
# [DataPoint(data=0.5, location_zone=0.1, analysis_date='2019-07-19T20:37:38.309668', error=0)]
c.close()
numpy and pandas are optimized for these stuff and they are extremely fast.
i did a simple comparison test for you in the code below, to see how pandas DataFrame dominated in speed:
code
import pandas as pd
import numpy as np
from time import perf_counter
# init
a = np.array([0 if 500 < i < 510 else 1 for i in range(100, 1000000)])
data_points = {'data': np.arange(100, 1000000),
'location_zone': np.arange(100, 1000000),
'analysis_date': np.arange(100, 1000000) * a,
'error': np.arange(100, 1000000)}
df = pd.DataFrame(data_points)
# speed of dataframe
t0 = perf_counter()
b = df[df['analysis_date'] == 0]
print("pandas DataFrame took: {:.4f} sec".format(perf_counter() - t0))
print(b)
# speed normal python code
t0 = perf_counter()
indices = [d for d in range(data_points['analysis_date'].shape[0]) if data_points['analysis_date'][d] == 0]
print("normal python code took: {:.4f} sec".format(perf_counter() - t0))
print(indices)
output
pandas DataFrame took: 0.0049 sec
analysis_date data error location_zone
401 0 501 501 501
402 0 502 502 502
403 0 503 503 503
404 0 504 504 504
405 0 505 505 505
406 0 506 506 506
407 0 507 507 507
408 0 508 508 508
409 0 509 509 509
normal python code took: 0.2782 sec
[401, 402, 403, 404, 405, 406, 407, 408, 409]
pandas DataFrame reference: Link
a good tutorial on DataFrames: Link
Following code:
def get_data_points_on_date(self, data_points, analysis_date):
data_on_date = []
for data_point in data_points:
if data_point.analysis_date == analysis_date:
data_on_date.append(data_point)
return data_on_date
can be refactored to:
def get_data_points_on_date(self, data_points, analysis_date):
return (p for p in data_points if p.analysis_date == analysis_date)
You may access that returned value in for loop or make it a list with list(returned_value)
.
If you have a list of such DataPoints, you can make them accessible with O(1) lookup using pandas
and a MultiIndex:
import pandas as pd
datapoints_series = pd.DataFrame(
{
"data": pt.data,
"location_zone": pt.location_zone,
"analysis_date": pt.analysis_date,
"error": pt.error,
"data_point": pt
}
for pt in data_points_list
).set_index([
"data",
"location_zone",
"analysis_date",
"error"
]).squeeze() # send to Series
To access a particular date:
def date_accessor(date):
idx = pd.IndexSlice[:, :, date, :]
date = "2019-07-01"
datapoints_series.loc[date_accessor(date)]
If you want the datapoints in a list again, you can simply append a .tolist()
method call to that last line.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.