简体   繁体   中英

Python: How do I efficiently nest 4 list of dictionaries into one?

I have an MSSQL stored procedure which returns 4 selections to me: Entities , Certificates , Contacts and Logs . I need to combine these 4 selections in Pyton where a I put all Entities , Contacts and Logs under their Certificate . Each of these selections has an EntityId I can use for the merge.

The inputs are lists of simple, basic dataclasses containing the information from SQL. We convert these dataclasses into dictionaries inside the merging function.

When I originally wrote the code, I had no idea that the selections could be very large (100.000s of Certificates with all their other records). Unfortunately this made the code below very inefficient due to the many unnecessary iterations of the list comprehensions inside the loop. It can take up to 70 seconds. I am sure that there is a way to make this much faster. How do I improve the performance to being as efficient as possible?

from dataclasses import asdict

def cert_and_details(entities: List[Entity], 
                    certificates: List[Certificate], 
                    req_logs: List[DocumentRequestHistory], 
                    recipients: List[Recipient]) -> List[dict]:

    entities = [asdict(ent) for ent in entities] 
    certificates = [asdict(cert) for cert in certificates]
    req_logs = [asdict(log) for log in req_logs]
    recipients = [asdict(rec) for rec in recipients]

    results = []
    for cert_dict in certificates:

        cert_entity_id = cert_dict["entityid"]

        logs_under_cert = [log for log in req_logs if log["entityid"] == cert_entity_id]
        cert_dict["logs"] = logs_under_cert

        entities_under_cert = [ent for ent in entities if ent["entityid"] == cert_entity_id]
        cert_dict["linkedentity"] = entities_under_cert

        recipients_under_cert = [rec for rec in recipients if rec["entityid"] == cert_entity_id]
        cert_dict["recipients"] = recipients_under_cert

        results.append(cert_dict)

    return results

The main issue with the provided code is its computational complexity : it runs in O(C * (L + E + R)) where C is the number of certificates, L the number of logs, E the number of entities and R the number of recipients. This is fine if L+E+R is small, but if this is not the case, then the code will be slow.

You can write an implementation running in O(C + L + E + R) time. The idea is to build an index first to group logs/entities/recipients by entity ID . Here is a short example:

# Note: defaultdict should help to make this code smaller (and possibly faster)
logIndex = dict()
for log in req_logs:
    entityId = log["entityid"]
    if entityId in logIndex:
        logIndex[entityId].append(log)
    else:
        logIndex[entityId] = [log]

This code runs in (amortized) O(L) . You can then retrieve all the items in req_log with a given entity ID using just logIndex[entityId] .

There is another issue in the provided code: lists of dictionaries are inefficient : a dictionary indexing is slow and dictionaries are not memory efficient either. A better way to store and compute data could be to use dataframes (eg with Pandas which also provide a relatively optimized groupby function).

Below might also be another way to make the complexity order(2*C+L+E+R).

Caveat: I haven't tried running this, and it's just mock-up code that isn't trying to be as efficient as possible. I also just mocked it up thinking conceptually how to make it linear complexity, and it might have some fundamental 'Ooops' that I missed.

But it's based on the concept of looping through each of C, L, E and R's only once. This is done by first making certificates a dictionary, instead of list. The key is it's entityid . The lists to store each certificates logs, entities and recipients are also created at that time.

Then you can loop through L, E and R's only once, and directly add their entries to the certificates dictionary by looking up the entityid.

The final step (so why the 2*C in the complexity) would be to loop back through the certificates dictionary and turn it into a list to match the desired output type.

from dataclasses import asdict

def cert_and_details(entities: List[Entity], 
                    certificates: List[Certificate], 
                    req_logs: List[DocumentRequestHistory], 
                    recipients: List[Recipient]) -> List[dict]:

    certs = {}

    for cert in certificates:
        cert_dict = asdict(cert)
        cert_id = cert_dict['entityid']

        certs[cert_id] = cert_dict
        certs['logs'] = []
        certs['recipients'] = []
        certs['linkedentity'] = []

    for log in logs:
        log_dict = asdict(log)
        log_id = log_dict['entityid']
        certs[log_id]['logs'].append(log_dict)


    for ent in entities:
        ent_dict = asdict(ent)
        ent_id = ent_dict['entityid']
        certs[ent_id]['linkedentity'].append(ent_dict)

    for rec in recipients:
        rec_dict = asdict(rec)
        rec_id = rec_dict['entityid']
        certs[rec_id]['recipients'].append(rec_dict)

    # turn certs back into list, not dictionary
    certs = [cert for cert in certs.values()]

    return certs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM