简体   繁体   中英

Remove duplicates from the list of dictionaries (with a unique value)

I have a list of dictionaries each of them describing a file (file format, filename, filesize, ... and a full path to the file [ always unique ]). The goal is to exclude all but one dictionaries describing copies of the same file (I just want a single dict (entry) per file, no matter how many copies there are.

In other words: if 2 (or more) dicts differ only in a single key ( ie path ) - leave only one of them).

For example, here is the source list:

src_list = [{'filename': 'abc', 'filetype': '.txt', ... 'path': 'C:/'},
            {'filename': 'abc', 'filetype': '.txt', ... 'path': 'C:/mydir'},
            {'filename': 'def', 'filetype': '.zip', ... 'path': 'C:/'},
            {'filename': 'def', 'filetype': '.zip', ... 'path': 'C:/mydir2'}]

The result should look like this:

dst_list = [{'filename': 'abc', 'filetype': '.txt', ... 'path': 'C:/'},
            {'filename': 'def', 'filetype': '.zip', ... 'path': 'C:/mydir2'}]

Use another dictionary to map the dictionaries from the list without the "ignored" keys to the actual dictionaries. This way, only one of each kind will be retained. Of course, dicts are not hashable, so you have to use (sorted) tuples instead.

src_list = [{'filename': 'abc', 'filetype': '.txt', 'path': 'C:/'},
            {'filename': 'abc', 'filetype': '.txt', 'path': 'C:/mydir'},
            {'filename': 'def', 'filetype': '.zip', 'path': 'C:/'},
            {'filename': 'def', 'filetype': '.zip', 'path': 'C:/mydir2'}]
ignored_keys = ["path"]
filtered = {tuple((k, d[k]) for k in sorted(d) if k not in ignored_keys): d for d in src_list}
dst_lst = list(filtered.values())

Result is:

[{'path': 'C:/mydir', 'filetype': '.txt', 'filename': 'abc'}, 
 {'path': 'C:/mydir2', 'filetype': '.zip', 'filename': 'def'}]

My own solution (maybe not the best, but it worked):

    dst_list = []
    seen_items = set()
    for dictionary in src_list:
        # here we cut the unique key (path) out to add it back later after a duplicate check
        path = dictionary.pop('path', None)
        t = tuple(dictionary.items())
        if t not in seen_items:
            seen_items.add(t)
            # duplicate-check passed, adding the unique key back to it's dictionry
            dictionary['path'] = path
            dst_list.append(dictionary)

    print(dst_list) 

Where

src_list is the original list with possible duplicates,

dst_list is the final duplicate-free list,

path is the unique key

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM