简体   繁体   中英

Compare only certain dictionary key values within list of dictionaries Python

I've 300k+ individual dictionaries from API calls with the format: (1 API call will return 1 dict, so each of the following dict are results of successful consecutive API calls, and after every API call the code needs to run over the returned dict)

{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 21, 'Type': 'dwg', 'Size(MB)': 98, 'uid': 732854}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 8, 'Type': 'pdf', 'Size(MB)': 42, 'uid': 735554}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 16, 'Type': 'docx', 'Size(MB)': 104, 'uid': 746748}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 8, 'Type': 'pptx', 'Size(MB)': 57, 'uid': 731024}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAM', 'Files': 8, 'Type': 'dwg', 'Size(MB)': 71, 'uid': 737328}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAM', 'Files': 8, 'Type': 'docx', 'Size(MB)': 22, 'uid': 376494}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MIM', 'Files': 8, 'Type': 'pptx', 'Size(MB)': 28, 'uid': 687281}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MIM', 'Files': 8, 'Type': 'docx', 'Size(MB)': 20, 'uid': 687231}
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MET', 'Files': 20, 'Type': 'pptx', 'Size(MB)': 204, 'uid': 457281}

I've to append the above individual dictionaries into a list of dictionaries with the following conditions:

  1. dwg, pdf, bmp are prefered 'type'
  2. docx, pptx, xlsx are non-preferred, only to be considered if any of the above formats are not present
  3. (N.) name, batch, Sem, (Sub) subject, files, size could be any value, All dict for the same ['N.','Batch','Sem','Sub'] set return together consequently one after the other in their individual API calls.
  4. uid is a unique number for every individual dictionary. Never repeated.
  5. An entry is the same ['N.','Batch','Sem','Sub'] set. So, an individual dict with non-preferred 'type' as a value, should not make it to the final list if there is any entry in the final list already (for eg, if any entry with docx/dwg/pdf/bmp already exists, pptx should not make it, )
  6. There's no hierarchy in the 'type' amongst preferred and non-preferred. For eg: if an entry with pptx is present, another entry with docx should not make it
  7. Initially list is empty.

So out of the above data only following should make it to the final list:

[{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 21, 'Type': 'dwg', 'Size(MB)': 98, 'uid': 732854},
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 8, 'Type': 'pdf', 'Size(MB)': 42, 'uid': 735554},
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAM', 'Files': 8, 'Type': 'dwg', 'Size(MB)': 71, 'uid': 737328},
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MIM', 'Files': 8, 'Type': 'pptx', 'Size(MB)': 28, 'uid': 687281},
{'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MET', 'Files': 20, 'Type': 'pptx', 'Size(MB)': 204, 'uid': 457281},
...]

The code i tried using:

list = []
dict = {'N.': name, 'Batch': year, 'Sem': semester, 'Sub': subject, 'Files': nofiles, 'Type': format, 'Size(MB)': size, 'uid': uniqueid}
comparekeys = ['N.','Batch','Sem','Sub']
nptype = ['docx', 'pptx', 'xlsx']
if dict not in list and format in nptype:
   for key in comparekeys:
      if dict[key] == (item[key] for item in list):
         break
list.append(dict)

The above code also appends the non-preferred formats and is unable to lookup if an entry already exists in the list. I tried with zip(), set(), .keys() too but couldn't formulate the right code.

You said all dict for the same ['N.','Batch','Sem','Sub'] set return together consecutively one after the other in API calls. So I'm going to presume they are grouped together.

Use itertools.groupby() to process each group of dicts. For each group, sort them so that preferred types are before non-preferred types. Then the first of the sorted dicts is always added to the results because it is either preferred type, or there aren't any preferred types. Of the remaining sorted dicts, only those with a preferred type are appended to the results.

import itertools as it

data = [
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 21, 'Type': 'dwg', 'Size(MB)': 98, 'uid': 732854},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 8, 'Type': 'pdf', 'Size(MB)': 42, 'uid': 735554},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 16, 'Type': 'docx', 'Size(MB)': 104, 'uid': 746748},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAD', 'Files': 8, 'Type': 'pptx', 'Size(MB)': 57, 'uid': 731024},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAM', 'Files': 8, 'Type': 'dwg', 'Size(MB)': 71, 'uid': 737328},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'CAM', 'Files': 8, 'Type': 'docx', 'Size(MB)': 22, 'uid': 376494},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MIM', 'Files': 8, 'Type': 'pptx', 'Size(MB)': 28, 'uid': 687281},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MIM', 'Files': 8, 'Type': 'docx', 'Size(MB)': 20, 'uid': 687231},
    {'N.': 'Sam', 'Batch': 2019, 'Sem': 'I', 'Sub': 'MET', 'Files': 20, 'Type': 'pptx', 'Size(MB)': 204, 'uid': 457281},
]

preferred_types = ('dwg', 'pdf', 'bmp')

result = []

key = lambda v:(v['N.'], v['Batch'], v['Sem'], v['Sub'])

for _, values in it.groupby(data, key=key):
    values = sorted(values, key=lambda v:v['Type'] not in preferred_types)
    
    result.append(values[0])
    
    result.extend(value for value in values[1:] if value['Type'] in preferred_types)
            
for row in result:
    print(row)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM