简体   繁体   中英

Pythonic way to find a dictionary that matches key, value pairs in another dictionary

I'm trying to find a way to match the key, value pairs of one dictionary to another. The first dictionary, record is a record with a static number of keys that do not change (although the values for each key can of course change), but the second dictionary, potential_outputs is user-defined and has variable keys and values. The user chooses which keys from the record they want to assign, assigns them a value, and then assigns an output value that is used when a match is found.

Example:

record = [
    {"Name": "John Smith", "Class": "c1", "Plan": "p1",},
    {"Name": "Jane Doe", "Class": "c2", "Plan": "p2",},
]
potential_outputs = [
    {"Class": "c1", "Plan": "p1", "Output": "o11"},
    {"Class": "c1", "Plan": "p2", "Output": "o12"},
    {"Class": "c2", "Plan": "p1", "Output": "o21"},
    {"Class": "c2", "Plan": "p2", "Output": "o22"},
]

The program needs to be able to loop through each dictionary in the record list, determine which dictionary in potential_outputs matches the key, value pairs, and then return the "Output" from the matching potential_outputs dictionary.

Expected output would be something along the lines of:

[
    {"Name": "John Smith", "Output": "o11"},
    {"Name": "Jane Doe", "Output": "o22"},
]

I also want to note that I am not committed to using dictionaries in order to resolve this issue.

Thank you!

You could group your outputs with a (Class, Plan) tuple key, then output the the found output dictionaries using a list comprehension.

Using a output lookup dictionary for O(1) lookups allows the solution to be O(N + M) , instead of O(N * M) , where N is the number of dictionaries in record , and M is the number of dictionaries in potential_outputs .

record = [
    {"Name": "John Smith", "Class": "c1", "Plan": "p1",},
    {"Name": "Jane Doe", "Class": "c2", "Plan": "p2",},
]

potential_outputs = [
    {"Class": "c1", "Plan": "p1", "Output": "o11"},
    {"Class": "c1", "Plan": "p2", "Output": "o12"},
    {"Class": "c2", "Plan": "p1", "Output": "o21"},
    {"Class": "c2", "Plan": "p2", "Output": "o22"},
]

outputs = {(output["Class"], output["Plan"]): output["Output"] for output in potential_outputs}

result = [{"Name": r["Name"], "Output": outputs[r["Class"], r["Plan"]]} for r in record]

print(result)

Output:

[{'Name': 'John Smith', 'Output': 'o11'}, {'Name': 'Jane Doe', 'Output': 'o22'}]

To avoid nested looping and M*N complexity, you can preprocess record

from collections import defaultdict

rec = defaultdict(lambda: defaultdict(list))
for r in record:
    rec[r['Class']][r['Plan']].append(r['Name'])

before looping through the potential_outputs

result = [{"Name": name, "Output": po["Output"]} 
          for po in potential_outputs 
          for name in rec[po['Class']][po['Plan']]]
result
# [{'Name': 'John Smith', 'Output': 'o11'}, {'Name': 'Jane Doe', 'Output': 'o22'}]

It is possible to do this, and have better than linear performance by creating a 3rd dictionary to be used as an index. The "keys" on the index dictionary should be sets of key/value pairs that can be valid identifiers to the desired output record. It looks like if you generate this index with FrosenSets containing tuples - something like:


def make_index(data):
    result_index = {}
    for row in data:
        work_row = row.copy()
        work_row.pop("Output")
        while work_row:
            key = frozenset((key, value) for key, value in work_row.items())
            result_index.setdefault(key, []).append(row)
            work_row.pop(next(iter(work_row))) 
    return result_index


def search(index, row_key):
    row_key = row_key.copy()
    row_key.pop("Name", None)
    key = frozenset((key, value) for key, value in row_key.items())
    return index[key]

And this works if "potential_outputs" have all the keys except "Name":

In [35]: search(index, record[0])                                                                                                                    
Out[35]: [{'Class': 'c1', 'Plan': 'p1', 'Output': 'o11'}]

In [36]: index = make_index(potential_outputs)                                                                                                       

In [37]: search(index, record[0])                                                                                                                    
Out[37]: [{'Class': 'c1', 'Plan': 'p1', 'Output': 'o11'}]

If you want mtches that occur with less matching keys than just stripping name, the same index works, but the "search" code have to be changed. And then we have to know exactly what are the desired matches to query accordingly. If "class" and "plan" matches different records, should both be returned? Or None? You will likely find something in itertools to generate all keys you want search for, given a row in records.

Meanwhile, anyway, this code is already fit to recover multiple results if everything matches:


In [39]: search(index, {"Plan": "p2"})                                                                                                               
Out[39]: 
[{'Class': 'c1', 'Plan': 'p2', 'Output': 'o12'},
 {'Class': 'c2', 'Plan': 'p2', 'Output': 'o22'}]

Here is a really simple way to handle it using pandas :

import pandas as pd

# Read your list of dicts into DataFrames.
dfr = pd.DataFrame(record)
dfp = pd.DataFrame(potential_outputs)

# Merge the two DataFrames on `Class` and `Plan` and return the result.
result = pd.merge(dfr, 
                  dfp, 
                  how='inner', 
                  on=['Class', 'Plan']).drop(['Class', 'Plan'], axis=1)

Output1:

As a DataFrame:

    Name    Output
0   John Smith  o11
1   Jane Doe    o22

Output2:

As a list:

result2 = [i for i in result.T.to_dict().values()]

[{'Name': 'John Smith', 'Output': 'o11'}, {'Name': 'Jane Doe', 'Output': 'o22'}]

If you would make potential_outputs a dict with the form {("c1","p1"): "o11"} , you could do that:

result = []
for a in record:
    if (a["Class"], a["Plan"]) in potential_outputs:
         result.append({"Name": a["Name"], "Output": potential_outputs[(a["Class"], a["Plan"])]})

That's maybe not the best way, but would be a pure Python way.

If you're interested in an one-liner

result = [{"Name": r["Name"], "Output": o["Output"]} for r in record for o in potential_outputs if r["Class"] == o["Class"] and r["Plan"] == o["Plan"]]

You could restructure your potential_outputs as a dictionary:

potential_output_dict = {
    f"{o['Class']}_{o['Plan']}": o['Output'] for o in potential_outputs
}

output = []
for r in record:
    plan_key = f"{r['Class']}_{r['Plan']}"
    plan = potential_output_dict.get(plan_key)
    if not plan:
        continue

    output.append({
        "Name": r['Name'],
        "Plan": plan,
     })

print(output)

This way you are using get() which is a bit nicer than iterating over the list of dictionaries multiple times.

(code not tested)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM