[英]How to remove duplicates of huge lists of objects in Python
I have gigantic lists of objects with many duplicates (I'm talking thousands of lists with thousands of objects each, taking up to about 10million individual objects (already without duplicates). 我有很多重复的对象列表(我说的是数千个列表,每个列表包含数千个对象,占用大约1000万个单独的对象(已经没有重复)。
I need to go through them and remove all the duplicates inside each list (no need to compare between lists, only inside each one). 我需要浏览它们并删除每个列表中的所有重复项(不需要在列表之间进行比较,只在每个列表中进行比较)。
I can, of course, go through the lists and compare with any dedupe algorithm that has been posted a lot of times, but I would guess this would take me forever. 当然,我可以浏览列表并与已经发布很多次的任何重复数据删除算法进行比较,但我猜这将永远带我。
I thought I could create an object with a crafted __hash__
method and use a list(set(obj))
to remove them but first: I don't know if this would work, second: I would still have to loop the lists to convert the elements to the new object. 我以为我可以使用精心设计的__hash__
方法创建一个对象并使用一个list(set(obj))
来删除它们但首先:我不知道这是否可行,第二:我仍然需要循环列表才能转换新对象的元素。
I know Python is not the best solution for what I am trying to achieve, but in this case, it will have to be done in Python. 我知道Python不是我想要实现的最佳解决方案,但在这种情况下,它必须在Python中完成。 I wonder what would be the best way to achieve this with the best performance possible. 我想知道以最佳性能实现这一目标的最佳方法是什么。
Edit : for clarification: I have about 2k lists of objects, with about 5k objects inside each one (rough estimate). 编辑 :澄清:我有大约2k个对象列表,每个对象里面有大约5k个对象(粗略估计)。 The duplicated objects are copies, and not references to the same memory location. 重复的对象是副本,而不是对同一内存位置的引用。 The lists (dicts) are basically converted JSON arrays 列表(dicts)基本上是转换后的JSON数组
Edit 2 : I'm sorry for not being clear, I will rephrase. 编辑2 :我很抱歉不清楚,我会改写。
This is for a django data migration, although my question only applies to the data 'formatting' and not the framework itself or database insertion. 这是一个django数据迁移,虽然我的问题只适用于数据“格式化”而不适用于框架本身或数据库插入。 I inserted a whole bunch of data as JSON to a table for later analysis. 我将一大堆数据作为JSON插入到表中供以后分析。 Now I need to normalize it and save it correctly. 现在我需要将其标准化并正确保存。 I created new tables and need to migrate the data. 我创建了新表并需要迁移数据。
So when I retrieve the data from the db I have about 2000 JSON arrays. 所以当我从db检索数据时,我有大约2000个JSON数组。 Applying json.loads(arr)
(by the documentation) I get 2000 lists of objects (dicts). 应用json.loads(arr)
(通过文档)我得到2000个对象列表(dicts)。 Each dict has only strings, numbers and booleans as values to each key, no nested objects/arrays, so something like this: 每个dict只有字符串,数字和布尔值作为每个键的值,没有嵌套的对象/数组,所以像这样:
[
{
a: 'aa',
b: 2,
c: False,
date: <date_as_long> // ex: 1471688210
},
{
a: 'bb',
b: 4,
c: True,
date: <date_as_long> // ex: 1471688210
}
]
What I need is to run through every list and remove duplicates. 我需要的是遍历每个列表并删除重复项。 Something is considered duplicate if all the fields except the date match (this wasn't in the original question, as I had not predicted it) inside a list. 如果除了日期之外的所有字段都匹配(这不是原始问题,因为我没有预测到),那么某些内容被认为是重复的。 If they match across different lists, they are not considered duplicates. 如果它们在不同的列表中匹配,则不会将它们视为重复。
After better analysis of the contents, I found out I have close to 2 million individual records (not 10 million as said previously). 在对内容进行更好的分析后,我发现我有近200万个人记录(如前所述,不是1000万个)。 The performance problems I face are because each dict needs to suffer some sort of data formatting (converting dates, for example) and 'wrap' it in the model object for database insertion: ModelName(a='aaa', b=2, c=True, date=1471688210)
. 我遇到的性能问题是因为每个dict需要进行某种数据格式化(例如转换日期)并将其“包装”在模型对象中以进行数据库插入: ModelName(a='aaa', b=2, c=True, date=1471688210)
。
The insertion on the database itself is done by bulk_create
. 数据库本身的插入由bulk_create
完成。
NOTE : I'm sorry for the lack of clarification on the original question. 注意 :对于原始问题缺乏澄清,我很抱歉。 The more I dug into this the more I learned about what had to be done and how to handle the data. 我越了解这一点,就越了解必须完成的工作以及如何处理数据。
I accepted @tuergeist 's answer because it pointed to what I needed even with bad details on my part. 我接受了@tuergeist的答案,因为它指出了我所需要的东西,即使我的细节也不好。
Given dicts cannot be hashed, thus I can't add them to a set(), my solution was to create a set()
of tuples for the duplicated data, and verify the duplicates with it. 鉴于dicts不能被散列,因此我无法将它们添加到set()中,我的解决方案是为重复数据创建一个元组的set()
,并用它来验证重复项。 This prevented an extra iteration if the duplicates where in a list. 如果重复列表中的位置,这会阻止额外的迭代。
So it was something like this: 所以它是这样的:
data = [lots of lists of dicts]
formatted_data = []
duplicates = set()
for my_list in data:
for element in my_list:
a = element['a']
b = convert_whatever(element['b'])
c = element['c']
d = (a, b, c) # Notice how only the elements that count for checking if it's a duplicate are here (not the date)
if d not in duplicates:
duplicates.add(d)
normalized_data = {
a: a,
b: b,
c: c,
date: element['date']
}
formatted_data.append(MyModel(**normalized_data)
duplicates.clear()
After this, for better memory management, I used generators: 在此之后,为了更好的内存管理,我使用了生成器:
data = [lots of lists of dicts]
formatted_data = []
duplicates = set()
def format_element(el):
a = el['a']
b = convert_whatever(el['b'])
c = el['c']
d = (a, b, c)
if d not in duplicates:
duplicates.add(d)
normalized_data = {
'a': a,
'b': b,
'c': c,
'date': el['date']
}
formatted_data.append(MyModel(**normalized_data))
def iter_list(l):
[format_element(x) for x in l]
duplicates.clear()
[iter_list(my_list) for my_list in data]
Working code here: http://codepad.org/frHJQaLu 这里的工作代码: http : //codepad.org/frHJQaLu
NOTE : My finished code is a little different (and in a functional style) than this one. 注意 :我完成的代码与此代码略有不同(并且在功能样式中)。 This serves only as an example of how I solved the problem. 这只是我如何解决问题的一个例子。
Edit 3 : For the database insertion I used bulk_create. 编辑3 :对于数据库插入,我使用了bulk_create。 In the end it took 1 minute to format everything correctly (1.5 million unique entries, 225k duplicates) and 2 minutes to insert everything to the database. 最后花了1分钟正确格式化所有内容(150万个唯一条目,225k重复项)和2分钟将所有内容插入数据库。
Thank you all! 谢谢你们!
I'd suggest to have a sorted list (if possible), so you can be more precise when you want compare items (like a dictionnary I mean). 我建议有一个排序列表(如果可能的话),这样你就可以更精确地想要比较项目(就像我的意思是一个词典)。 A hash (or not) list can fulfill that thing. 哈希(或非)列表可以实现该功能。
If you have the ability to manage the "add and delete" from your lists, it's better ! 如果您能够管理列表中的“添加和删除”,那就更好了! Sort the new items each time you add/delete. 每次添加/删除时对新项目进行排序。 (IMO good if you have hash list, forgot if you have linked list). (IMO很好,如果你有哈希列表,忘记你有链表)。
Complexity will of course depends on your structure (fifo/filo list, linked list, hash...) 复杂性当然取决于你的结构(fifo / filo列表,链表,哈希...)
A fast, not order preserving solution for (hashable items) is (可洗物品)的快速而非订单保留解决方案是
def unify(seq):
# Not order preserving
return list(set(seq))
Complete Edit 完成编辑
I assume, that you have dicts
inside a list
. 我假设你在list
有dicts
。 And you have many lists. 你有很多名单。 The solution to remove duplicates from a single list is: 从单个列表中删除重复项的解决方案是:
def remove_dupes(mylist):
newlist = [mylist[0]]
for e in mylist:
if e not in newlist:
newlist.append(e)
return newlist
A list here contains the following dicts. 此处的列表包含以下序列。 (But all random) (但所有随机)
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter","lastName":"Jones"}
Running this, it took 8s for 2000 dicts on my MacBook (2,4GHz, i5) 运行这个,我的MacBook(2.4GHz,i5)花费了8s用于2000 dicts
Complete code: http://pastebin.com/NSKuuxUe 完整代码: http : //pastebin.com/NSKuuxUe
Here is a solution for sorted lists: 以下是排序列表的解决方案:
class Solution:
def removeDuplicates(self, nums):
"""
:type nums: List[int]
:rtype: int
"""
if (len(nums) == 0):
return 0;
j = 0
for i in range(len(nums)):
if (nums[i] != nums[j]):
j = j+1
nums[j] = nums[i];
return j + 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.