简体   繁体   English

在合并Python中的重复项时对字典列表进行排序?

[英]Sort a list of dictionaries while consolidating duplicates in Python?

So I have a list of dictionaries like so: 所以我有这样一个字典列表:

data = [ { 
           'Organization' : '123 Solar',
           'Phone' : '444-444-4444',
           'Email' : '',
           'website' : 'www.123solar.com'
         }, {
           'Organization' : '123 Solar',
           'Phone' : '',
           'Email' : 'joey@123solar.com',
           'Website' : 'www.123solar.com'
         }, {
           etc...
         } ]

Of course, this is not the exact data. 当然,这不是确切的数据。 But (maybe) from my example here you can catch my problem. 但是(也许)从我的示例中可以发现我的问题。 I have many records with the same "Organization" name, but not one of them has the complete information for that record. 我有许多具有相同“组织”名称的记录,但是其中没有一个具有该记录的完整信息。

Is there an efficient method for searching over the list, sorting the list based on the dictionary's first entry, and finally merging the data from duplicates to create a unique entry? 有没有一种有效的方法来搜索列表,根据字典的第一个条目对列表进行排序,最后合并重复项中的数据以创建唯一的条目? (Keep in mind these dictionaries are quite large) (请记住,这些词典很大)

You can make use of itertools.groupby : 您可以使用itertools.groupby

from itertools import groupby
from operator import itemgetter
from pprint import pprint

data = [ {
           'Organization' : '123 Solar',
           'Phone' : '444-444-4444',
           'Email' : '',
           'website' : 'www.123solar.com'
         }, {
           'Organization' : '123 Solar',
           'Phone' : '',
           'Email' : 'joey@123solar.com',
           'Website' : 'www.123solar.com'
         },
         {
           'Organization' : '234 test',
           'Phone' : '111',
           'Email' : 'a@123solar.com',
           'Website' : 'b.123solar.com'
         },
         {
           'Organization' : '234 test',
           'Phone' : '222',
           'Email' : 'ac@123solar.com',
           'Website' : 'bd.123solar.com'
         }]


data = sorted(data, key=itemgetter('Organization'))
result = {}
for key, group in groupby(data, key=itemgetter('Organization')):
    result[key] = [item for item in group]

pprint(result)

prints: 印刷品:

{'123 Solar': [{'Email': '',
                'Organization': '123 Solar',
                'Phone': '444-444-4444',
                'website': 'www.123solar.com'},
               {'Email': 'joey@123solar.com',
                'Organization': '123 Solar',
                'Phone': '',
                'Website': 'www.123solar.com'}],
 '234 test': [{'Email': 'a@123solar.com',
               'Organization': '234 test',
               'Phone': '111',
               'Website': 'b.123solar.com'},
              {'Email': 'ac@123solar.com',
               'Organization': '234 test',
               'Phone': '222',
               'Website': 'bd.123solar.com'}]}

UPD: UPD:

Here's what you can do to group items into single dict: 您可以按照以下步骤将项目分为单个字典:

for key, group in groupby(data, key=itemgetter('Organization')):
    result[key] = {'Phone': [],
                   'Email': [],
                   'Website': []}
    for item in group:
        result[key]['Phone'].append(item['Phone'])
        result[key]['Email'].append(item['Email'])
        result[key]['Website'].append(item['Website'])

then, in result you'll have: 然后, result是:

{'123 Solar': {'Email': ['', 'joey@123solar.com'],
               'Phone': ['444-444-4444', ''],
               'Website': ['www.123solar.com', 'www.123solar.com']},
 '234 test': {'Email': ['a@123solar.com', 'ac@123solar.com'],
              'Phone': ['111', '222'],
              'Website': ['b.123solar.com', 'bd.123solar.com']}}

Is there an efficient method for searching over the list, sorting the list based on the dictionary's first entry, and finally merging the data from duplicates to create a unique entry? 有没有一种有效的方法来搜索列表,根据字典的第一个条目对列表进行排序,最后合并重复项中的数据以创建唯一的条目?

Yes, but there's an even more efficient method without searching and sorting. 是的,但是有一种甚至没有搜索和排序的更有效的方法。 Just build up a dictionary as you go along: 继续学习时,只需建立字典:

datadict = {}
for thingy in data:
    organization = thingy['Organization']
    datadict[organization] = merge(thingy, datadict.get(organization, {}))

Now you've making a linear pass over the data, doing a constant-time lookup for each one. 现在,您已经对数据进行了线性传递,并对每个数据进行了恒定时间的查找。 So, it's better than any sorted solution by a factor of O(log N). 因此,它比任何排序的解决方案都要好O(log N)。 It's also one pass instead of multiple passes, and it will probably have lower constant overhead besides. 这也是一次通过,而不是多次通过,而且它的常量开销可能会更低。


It's not clear exactly what you want to do to merge the entries, and there's no way anyone can write the code without knowing what rules you want to use. 目前尚不清楚您要合并这些条目的确切方式,并且没有人可以在不知道要使用什么规则的情况下编写代码。 But here's a simple example: 但这是一个简单的示例:

def merge(d1, d2):
    for key, value in d2.items():
        if not d1.get(key):
            d1[key] = value
    return d1

In other words, for each item in d2 , if d1 already has a truthy value (like a non-empty string), leave it alone; 换句话说,对于d2每个项目,如果d1已经具有真实值(例如非空字符串),则将其保留;否则,将其保留。 otherwise, add it. 否则,添加它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM