简体   繁体   中英

Fastest way to search a list of named tuples?

I have a list of named tuples. Each named tuple is a DataPoint type I have created, that looks like this:

class DataPoint(NamedTuple):
    data: float
    location_zone: float
    analysis_date: datetime
    error: float

At various points throughout my code, I have to get all the DataPoints in the list by a particular attribute. Here's how I do it for analysis_date , I have similar functions for the other attributes:

def get_data_points_on_date(self, data_points, analysis_date):
    data_on_date = []
    for data_point in data_points:
        if data_point.analysis_date == analysis_date:
            data_on_date.append(data_point)
    return data_on_date

This is called >100,000 times on lists with thousands of points, so it is slowing down my script significantly.

Instead of a list, I could do a dictionary for a significant speedup, but because I need to search on multiple attributes, there isn't an obvious key. I would probably choose the function that is taking up the most time (in this case, analysis_date ), and use that as the key. However, this would add significant complexity to my code. Is there anything besides hashing / a clever way to hash that is escaping me?

You are right that you want to avoid doing what is essentially a linear search 100,000 times if the data can be pre-computed once. Why not use multiple dictionaries, each keyed by a different attribute of interest?

Each dictionary would be pre-computed once:

self.by_date = defaultdict(list)
for point in data_points:
    self.by_date[point.analysis_date].append(point)

Now your get_data_points_for_date function becomes a one-liner:

def get_data_points_for_date(self, date):
    return self.by_date[date]

You could probably remove this method entirely, and just use self.by_date[date] instead.

This does not increase the complexity of your code, but it does transfer some of the book-keeping burden up front. You could handle that by having a set_data method that pre-computes all the dictionaries you want:

from collections import defaultdict
from operator import attrgetter

def set_data(self, data_points):
    keygetter):
        d = defaultdict(list)
        for point in data_points:
            d[key(point)].append(point)
        return d

    self.by_date = make_dict(attrgetter('analysis_date'))
    self.by_zone = make_dict(self.zone_code)

def zone_code(self, data_point):
    return int(data_point.location_zone // 0.01)

Something like zone_code is necessary to convert float s to integers, since it is not a good idea to rely on float s as keys.

Perhaps an in memory SQLite database (with column indexes) could help. It even has a way to map rows to named tuples as Mapping result rows to namedtuple in python sqlite describes.

For a more complete solution refer, for example, to http://peter-hoffmann.com/2010/python-sqlite-namedtuple-factory.html .


A basic example based on the two links above:

from typing import NamedTuple
from datetime import datetime
import sqlite3


class DataPoint(NamedTuple):
    data: float
    location_zone: float
    analysis_date: datetime
    error: float


def datapoint_factory(cursor, row):
    return DataPoint(*row)


def get_data_points_on_date(cursor, analysis_date):
    cursor.execute(
        f"select * from datapoints where analysis_date = '{analysis_date}'"
    )
    return cursor.fetchall()


conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute(
    "create table datapoints "
    "(data real, location_zone real, analysis_date text, error timestamp)"
)
c.execute(
    "create index if not exists analysis_date_index on datapoints (analysis_date)"
)


timestamp = datetime.now().isoformat()
data_points = [
    DataPoint(data=0.5, location_zone=0.1, analysis_date=timestamp, error=0.0)
]

for data_point in data_points:
    c.execute(f"insert into datapoints values {tuple(data_point)}")

conn.commit()
c.close()

conn.row_factory = datapoint_factory
c = conn.cursor()

print(get_data_points_on_date(c, timestamp))
# [DataPoint(data=0.5, location_zone=0.1, analysis_date='2019-07-19T20:37:38.309668', error=0)]
c.close()

i strongly suggest using numpy & pandas

numpy and pandas are optimized for these stuff and they are extremely fast.

i did a simple comparison test for you in the code below, to see how pandas DataFrame dominated in speed:

code

import pandas as pd
import numpy as np
from time import perf_counter

# init
a = np.array([0 if 500 < i < 510 else 1 for i in range(100, 1000000)])
data_points = {'data': np.arange(100, 1000000),
        'location_zone': np.arange(100, 1000000),
        'analysis_date': np.arange(100, 1000000) * a,
        'error': np.arange(100, 1000000)}

df = pd.DataFrame(data_points)

# speed of dataframe
t0 = perf_counter()
b = df[df['analysis_date'] == 0]
print("pandas DataFrame took: {:.4f} sec".format(perf_counter() - t0))
print(b)

# speed normal python code
t0 = perf_counter()
indices = [d for d in range(data_points['analysis_date'].shape[0]) if data_points['analysis_date'][d] == 0]
print("normal python code took: {:.4f} sec".format(perf_counter() - t0))
print(indices)

output

pandas DataFrame took: 0.0049 sec
     analysis_date  data  error  location_zone
401              0   501    501            501
402              0   502    502            502
403              0   503    503            503
404              0   504    504            504
405              0   505    505            505
406              0   506    506            506
407              0   507    507            507
408              0   508    508            508
409              0   509    509            509

normal python code took: 0.2782 sec
[401, 402, 403, 404, 405, 406, 407, 408, 409]

pandas DataFrame reference: Link

a good tutorial on DataFrames: Link

Following code:

def get_data_points_on_date(self, data_points, analysis_date):
    data_on_date = []
    for data_point in data_points:
        if data_point.analysis_date == analysis_date:
            data_on_date.append(data_point)
    return data_on_date

can be refactored to:

def get_data_points_on_date(self, data_points, analysis_date):
    return (p for p in data_points if p.analysis_date == analysis_date)

You may access that returned value in for loop or make it a list with list(returned_value) .

If you have a list of such DataPoints, you can make them accessible with O(1) lookup using pandas and a MultiIndex:

import pandas as pd

datapoints_series = pd.DataFrame(
    {
        "data": pt.data,
        "location_zone": pt.location_zone,
        "analysis_date": pt.analysis_date,
        "error": pt.error,
        "data_point": pt
    }
    for pt in data_points_list
).set_index([
    "data",
    "location_zone",
    "analysis_date",
    "error"
]).squeeze() # send to Series

To access a particular date:

def date_accessor(date):
    idx = pd.IndexSlice[:, :, date, :]

date = "2019-07-01"
datapoints_series.loc[date_accessor(date)]

If you want the datapoints in a list again, you can simply append a .tolist() method call to that last line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM