简体   繁体   中英

Detect and delete duplicates based on specific dictionary key in a list of dictionaries

I have a list of dictionaries where each dictionary is information about an article. Sometimes, the same article "title" repeats across dictionaries. I want to remove these duplicate dictionaries so that each article in the list of dictionaries is unique by title ie no title repeats across the dictionaries.

I have

data = [{'title':'abc','source':'x','url':'abcx.com'},
            {'title':'abc','source':'y','url':'abcy.com'},
            {'title':'def','source':'g','url':'defg.com'}]

Expected result:

data = [{'title':'abc','source':'x','url':'abcx.com'},
            {'title':'def','source':'g','url':'defg.com'}]

A quick way is to keep a track of titles you have seen:

titles_seen = set() #thank you @Mark Meyer
data = [{'title':'abc','source':'x','url':'abcx.com'},
        {'title':'abc','source':'y','url':'abcy.com'},
        {'title':'def','source':'g','url':'defg.com'}]
new_data = []
for item in data:
    if item['title'] not in titles_seen:
        new_data.append(item)
    titles_seen.add(item['title'])

As @Mark Meyer points out in the comments, you can use title as the key in the dictionary, which will eliminate duplicates due to the hashing of the title, or, one may define an Entry class, and then simply use frozenset (potential overkill):

>>> data
[<Entry title=abc source=x url=abcx.com />, <Entry title=abc source=y url=abcy.com />, <Entry title=def source=g url=defg.com />]
>>> frozenset(data)
frozenset({<Entry title=def source=g url=defg.com />, <Entry title=abc source=x url=abcx.com />})

class Entry:
    def __init__(self, title, source, url):
            self.title = title
            self.source = source
            self.url = url
    def __hash__(self):
            return hash(self.title)
    def __eq__(self, other):
            if isinstance(other, Entry):
                    return self.title == other.title
            return False
    def __ne__(self, other):
            return (not self.__eq__(other))
    def __repr__(self):
            return "<Entry title={} source={} url={} />".format(self.title, self.source, self.url)

But a better way is simply to check if the title exists before adding to the list in the first place.

Two lines with set:

tmp = set()
result = [tmp.add(i['title']) or i for i in data if i['title'] not in tmp]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM